• Intro
  • Data Information
    • Handling Missing data
    • Impact of Imputation
  • Data Questions
    • Target Variable Information
    • Demographic Analysis:
  • Conclusion

Intro

  1. Business Problem

• Description: This business problem that is being addressed is how to credit check and qualify a candidate for a loan with minimal or no credit history. Currently, there is no mechanism for people who haven’t had the time or opportunity to build credit with the use of credit card. However, Home Credit is trying to find a way for these underrepresented lost population of members of the economy to be counted, while also protecting themselves from too risky loans.

• Impact Analysis: The impact of this analysis is very large. Finding good candidates can not only help the company but the customers as well. As far as the company is concerned, it will allow them to stay afloat without having to risk defaults. Defaults would be a curse because there would be no way for the company to recoup their investments. On the customer side, they have confirmation that their loaned money will be in good hands. They know that the company is going to stay financially solvent so they do not have to deal with higher interest rates and sub-standard creditors.

Data Information

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.2
application_test <- read.csv("application_test.csv")
application_train <- read.csv("application_train.csv")

skim(application_train)
Data summary
Name application_train
Number of rows 307511
Number of columns 122
_______________________
Column type frequency:
character 16
numeric 106
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
NAME_CONTRACT_TYPE 0 1 10 15 0 2 0
CODE_GENDER 0 1 1 3 0 3 0
FLAG_OWN_CAR 0 1 1 1 0 2 0
FLAG_OWN_REALTY 0 1 1 1 0 2 0
NAME_TYPE_SUITE 0 1 0 15 1292 8 0
NAME_INCOME_TYPE 0 1 7 20 0 8 0
NAME_EDUCATION_TYPE 0 1 15 29 0 5 0
NAME_FAMILY_STATUS 0 1 5 20 0 6 0
NAME_HOUSING_TYPE 0 1 12 19 0 6 0
OCCUPATION_TYPE 0 1 0 21 96391 19 0
WEEKDAY_APPR_PROCESS_START 0 1 6 9 0 7 0
ORGANIZATION_TYPE 0 1 3 22 0 58 0
FONDKAPREMONT_MODE 0 1 0 21 210295 5 0
HOUSETYPE_MODE 0 1 0 16 154297 4 0
WALLSMATERIAL_MODE 0 1 0 12 156341 8 0
EMERGENCYSTATE_MODE 0 1 0 3 145755 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SK_ID_CURR 0 1.00 278180.52 102790.18 100002.00 189145.50 278202.00 367142.50 456255.00 ▇▇▇▇▇
TARGET 0 1.00 0.08 0.27 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
CNT_CHILDREN 0 1.00 0.42 0.72 0.00 0.00 0.00 1.00 19.00 ▇▁▁▁▁
AMT_INCOME_TOTAL 0 1.00 168797.92 237123.15 25650.00 112500.00 147150.00 202500.00 117000000.00 ▇▁▁▁▁
AMT_CREDIT 0 1.00 599026.00 402490.78 45000.00 270000.00 513531.00 808650.00 4050000.00 ▇▂▁▁▁
AMT_ANNUITY 12 1.00 27108.57 14493.74 1615.50 16524.00 24903.00 34596.00 258025.50 ▇▁▁▁▁
AMT_GOODS_PRICE 278 1.00 538396.21 369446.46 40500.00 238500.00 450000.00 679500.00 4050000.00 ▇▂▁▁▁
REGION_POPULATION_RELATIVE 0 1.00 0.02 0.01 0.00 0.01 0.02 0.03 0.07 ▇▇▃▁▁
DAYS_BIRTH 0 1.00 -16037.00 4363.99 -25229.00 -19682.00 -15750.00 -12413.00 -7489.00 ▃▆▇▇▅
DAYS_EMPLOYED 0 1.00 63815.05 141275.77 -17912.00 -2760.00 -1213.00 -289.00 365243.00 ▇▁▁▁▂
DAYS_REGISTRATION 0 1.00 -4986.12 3522.89 -24672.00 -7479.50 -4504.00 -2010.00 0.00 ▁▁▂▅▇
DAYS_ID_PUBLISH 0 1.00 -2994.20 1509.45 -7197.00 -4299.00 -3254.00 -1720.00 0.00 ▁▆▇▆▅
OWN_CAR_AGE 202929 0.34 12.06 11.94 0.00 5.00 9.00 15.00 91.00 ▇▁▁▁▁
FLAG_MOBIL 0 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
FLAG_EMP_PHONE 0 1.00 0.82 0.38 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
FLAG_WORK_PHONE 0 1.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
FLAG_CONT_MOBILE 0 1.00 1.00 0.04 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
FLAG_PHONE 0 1.00 0.28 0.45 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▃
FLAG_EMAIL 0 1.00 0.06 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
CNT_FAM_MEMBERS 2 1.00 2.15 0.91 1.00 2.00 2.00 3.00 20.00 ▇▁▁▁▁
REGION_RATING_CLIENT 0 1.00 2.05 0.51 1.00 2.00 2.00 2.00 3.00 ▁▁▇▁▂
REGION_RATING_CLIENT_W_CITY 0 1.00 2.03 0.50 1.00 2.00 2.00 2.00 3.00 ▁▁▇▁▂
HOUR_APPR_PROCESS_START 0 1.00 12.06 3.27 0.00 10.00 12.00 14.00 23.00 ▁▃▇▆▁
REG_REGION_NOT_LIVE_REGION 0 1.00 0.02 0.12 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REG_REGION_NOT_WORK_REGION 0 1.00 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
LIVE_REGION_NOT_WORK_REGION 0 1.00 0.04 0.20 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REG_CITY_NOT_LIVE_CITY 0 1.00 0.08 0.27 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
REG_CITY_NOT_WORK_CITY 0 1.00 0.23 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
LIVE_CITY_NOT_WORK_CITY 0 1.00 0.18 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
EXT_SOURCE_1 173378 0.44 0.50 0.21 0.01 0.33 0.51 0.68 0.96 ▂▇▇▇▃
EXT_SOURCE_2 660 1.00 0.51 0.19 0.00 0.39 0.57 0.66 0.85 ▂▂▃▇▃
EXT_SOURCE_3 60965 0.80 0.51 0.19 0.00 0.37 0.54 0.67 0.90 ▂▃▆▇▃
APARTMENTS_AVG 156061 0.49 0.12 0.11 0.00 0.06 0.09 0.15 1.00 ▇▁▁▁▁
BASEMENTAREA_AVG 179943 0.41 0.09 0.08 0.00 0.04 0.08 0.11 1.00 ▇▁▁▁▁
YEARS_BEGINEXPLUATATION_AVG 150007 0.51 0.98 0.06 0.00 0.98 0.98 0.99 1.00 ▁▁▁▁▇
YEARS_BUILD_AVG 204488 0.34 0.75 0.11 0.00 0.69 0.76 0.82 1.00 ▁▁▁▇▅
COMMONAREA_AVG 214865 0.30 0.04 0.08 0.00 0.01 0.02 0.05 1.00 ▇▁▁▁▁
ELEVATORS_AVG 163891 0.47 0.08 0.13 0.00 0.00 0.00 0.12 1.00 ▇▁▁▁▁
ENTRANCES_AVG 154828 0.50 0.15 0.10 0.00 0.07 0.14 0.21 1.00 ▇▃▁▁▁
FLOORSMAX_AVG 153020 0.50 0.23 0.14 0.00 0.17 0.17 0.33 1.00 ▇▃▁▁▁
FLOORSMIN_AVG 208642 0.32 0.23 0.16 0.00 0.08 0.21 0.38 1.00 ▅▇▁▁▁
LANDAREA_AVG 182590 0.41 0.07 0.08 0.00 0.02 0.05 0.09 1.00 ▇▁▁▁▁
LIVINGAPARTMENTS_AVG 210199 0.32 0.10 0.09 0.00 0.05 0.08 0.12 1.00 ▇▁▁▁▁
LIVINGAREA_AVG 154350 0.50 0.11 0.11 0.00 0.05 0.07 0.13 1.00 ▇▁▁▁▁
NONLIVINGAPARTMENTS_AVG 213514 0.31 0.01 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
NONLIVINGAREA_AVG 169682 0.45 0.03 0.07 0.00 0.00 0.00 0.03 1.00 ▇▁▁▁▁
APARTMENTS_MODE 156061 0.49 0.11 0.11 0.00 0.05 0.08 0.14 1.00 ▇▁▁▁▁
BASEMENTAREA_MODE 179943 0.41 0.09 0.08 0.00 0.04 0.07 0.11 1.00 ▇▁▁▁▁
YEARS_BEGINEXPLUATATION_MODE 150007 0.51 0.98 0.06 0.00 0.98 0.98 0.99 1.00 ▁▁▁▁▇
YEARS_BUILD_MODE 204488 0.34 0.76 0.11 0.00 0.70 0.76 0.82 1.00 ▁▁▁▇▅
COMMONAREA_MODE 214865 0.30 0.04 0.07 0.00 0.01 0.02 0.05 1.00 ▇▁▁▁▁
ELEVATORS_MODE 163891 0.47 0.07 0.13 0.00 0.00 0.00 0.12 1.00 ▇▁▁▁▁
ENTRANCES_MODE 154828 0.50 0.15 0.10 0.00 0.07 0.14 0.21 1.00 ▇▂▁▁▁
FLOORSMAX_MODE 153020 0.50 0.22 0.14 0.00 0.17 0.17 0.33 1.00 ▇▃▁▁▁
FLOORSMIN_MODE 208642 0.32 0.23 0.16 0.00 0.08 0.21 0.38 1.00 ▅▇▁▁▁
LANDAREA_MODE 182590 0.41 0.06 0.08 0.00 0.02 0.05 0.08 1.00 ▇▁▁▁▁
LIVINGAPARTMENTS_MODE 210199 0.32 0.11 0.10 0.00 0.05 0.08 0.13 1.00 ▇▁▁▁▁
LIVINGAREA_MODE 154350 0.50 0.11 0.11 0.00 0.04 0.07 0.13 1.00 ▇▁▁▁▁
NONLIVINGAPARTMENTS_MODE 213514 0.31 0.01 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
NONLIVINGAREA_MODE 169682 0.45 0.03 0.07 0.00 0.00 0.00 0.02 1.00 ▇▁▁▁▁
APARTMENTS_MEDI 156061 0.49 0.12 0.11 0.00 0.06 0.09 0.15 1.00 ▇▁▁▁▁
BASEMENTAREA_MEDI 179943 0.41 0.09 0.08 0.00 0.04 0.08 0.11 1.00 ▇▁▁▁▁
YEARS_BEGINEXPLUATATION_MEDI 150007 0.51 0.98 0.06 0.00 0.98 0.98 0.99 1.00 ▁▁▁▁▇
YEARS_BUILD_MEDI 204488 0.34 0.76 0.11 0.00 0.69 0.76 0.83 1.00 ▁▁▁▇▅
COMMONAREA_MEDI 214865 0.30 0.04 0.08 0.00 0.01 0.02 0.05 1.00 ▇▁▁▁▁
ELEVATORS_MEDI 163891 0.47 0.08 0.13 0.00 0.00 0.00 0.12 1.00 ▇▁▁▁▁
ENTRANCES_MEDI 154828 0.50 0.15 0.10 0.00 0.07 0.14 0.21 1.00 ▇▃▁▁▁
FLOORSMAX_MEDI 153020 0.50 0.23 0.15 0.00 0.17 0.17 0.33 1.00 ▇▃▁▁▁
FLOORSMIN_MEDI 208642 0.32 0.23 0.16 0.00 0.08 0.21 0.38 1.00 ▅▇▁▁▁
LANDAREA_MEDI 182590 0.41 0.07 0.08 0.00 0.02 0.05 0.09 1.00 ▇▁▁▁▁
LIVINGAPARTMENTS_MEDI 210199 0.32 0.10 0.09 0.00 0.05 0.08 0.12 1.00 ▇▁▁▁▁
LIVINGAREA_MEDI 154350 0.50 0.11 0.11 0.00 0.05 0.07 0.13 1.00 ▇▁▁▁▁
NONLIVINGAPARTMENTS_MEDI 213514 0.31 0.01 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
NONLIVINGAREA_MEDI 169682 0.45 0.03 0.07 0.00 0.00 0.00 0.03 1.00 ▇▁▁▁▁
TOTALAREA_MODE 148431 0.52 0.10 0.11 0.00 0.04 0.07 0.13 1.00 ▇▁▁▁▁
OBS_30_CNT_SOCIAL_CIRCLE 1021 1.00 1.42 2.40 0.00 0.00 0.00 2.00 348.00 ▇▁▁▁▁
DEF_30_CNT_SOCIAL_CIRCLE 1021 1.00 0.14 0.45 0.00 0.00 0.00 0.00 34.00 ▇▁▁▁▁
OBS_60_CNT_SOCIAL_CIRCLE 1021 1.00 1.41 2.38 0.00 0.00 0.00 2.00 344.00 ▇▁▁▁▁
DEF_60_CNT_SOCIAL_CIRCLE 1021 1.00 0.10 0.36 0.00 0.00 0.00 0.00 24.00 ▇▁▁▁▁
DAYS_LAST_PHONE_CHANGE 1 1.00 -962.86 826.81 -4292.00 -1570.00 -757.00 -274.00 0.00 ▁▁▂▃▇
FLAG_DOCUMENT_2 0 1.00 0.00 0.01 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_3 0 1.00 0.71 0.45 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
FLAG_DOCUMENT_4 0 1.00 0.00 0.01 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_5 0 1.00 0.02 0.12 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_6 0 1.00 0.09 0.28 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_7 0 1.00 0.00 0.01 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_8 0 1.00 0.08 0.27 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_9 0 1.00 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_10 0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_11 0 1.00 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_12 0 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_13 0 1.00 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_14 0 1.00 0.00 0.05 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_15 0 1.00 0.00 0.03 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_16 0 1.00 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_17 0 1.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_18 0 1.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_19 0 1.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_20 0 1.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
FLAG_DOCUMENT_21 0 1.00 0.00 0.02 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_HOUR 41519 0.86 0.01 0.08 0.00 0.00 0.00 0.00 4.00 ▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_DAY 41519 0.86 0.01 0.11 0.00 0.00 0.00 0.00 9.00 ▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_WEEK 41519 0.86 0.03 0.20 0.00 0.00 0.00 0.00 8.00 ▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_MON 41519 0.86 0.27 0.92 0.00 0.00 0.00 0.00 27.00 ▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_QRT 41519 0.86 0.27 0.79 0.00 0.00 0.00 0.00 261.00 ▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_YEAR 41519 0.86 1.90 1.87 0.00 0.00 1.00 3.00 25.00 ▇▁▁▁▁

Handling Missing data

Amount Required Credit Buereau with time distinction

I want to combine some similar sections of the data to see if there are a grouped way of handling the missing data.

# combining all columns then save to a data frame 
amt_req <- application_train %>%
  select(AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_MON, 
         AMT_REQ_CREDIT_BUREAU_QRT, AMT_REQ_CREDIT_BUREAU_WEEK, AMT_REQ_CREDIT_BUREAU_YEAR) %>%
  head(15)

# list unique values
list_unique_values <- list(
  AMT_REQ_CREDIT_BUREAU_DAY = unique(application_train$AMT_REQ_CREDIT_BUREAU_DAY),
  AMT_REQ_CREDIT_BUREAU_HOUR = unique(application_train$AMT_REQ_CREDIT_BUREAU_HOUR),
  AMT_REQ_CREDIT_BUREAU_MON = unique(application_train$AMT_REQ_CREDIT_BUREAU_MON),
  AMT_REQ_CREDIT_BUREAU_QRT = unique(application_train$AMT_REQ_CREDIT_BUREAU_QRT),
  AMT_REQ_CREDIT_BUREAU_WEEK = unique(application_train$AMT_REQ_CREDIT_BUREAU_WEEK),
  AMT_REQ_CREDIT_BUREAU_YEAR = unique(application_train$AMT_REQ_CREDIT_BUREAU_YEAR)
)

list_unique_values
## $AMT_REQ_CREDIT_BUREAU_DAY
##  [1]  0 NA  1  3  2  4  5  6  9  8
## 
## $AMT_REQ_CREDIT_BUREAU_HOUR
## [1]  0 NA  1  2  3  4
## 
## $AMT_REQ_CREDIT_BUREAU_MON
##  [1]  0 NA  1  2  6  5  3  7  9  4 11  8 16 12 14 10 13 17 24 19 15 23 18 27 22
## 
## $AMT_REQ_CREDIT_BUREAU_QRT
##  [1]   0  NA   1   2   4   3   8   5   6   7 261  19
## 
## $AMT_REQ_CREDIT_BUREAU_WEEK
##  [1]  0 NA  1  3  2  4  5  6  8  7
## 
## $AMT_REQ_CREDIT_BUREAU_YEAR
##  [1]  1  0 NA  2  4  5  3  8  6  9  7 10 11 13 16 12 25 23 15 14 22 17 19 18 21
## [26] 20

This column is described as the “Number of inquiries to the Credit Bureau about the client one {timeframe} before the application.” Given the significant number of NA values, alongside the valuable information contained within the non-NA values, it’s crucial to carefully consider how to handle this data. The presence of NA values raises questions, yet the data without NAs remains informative. It seems that the NAs might not be inherently informative, and straightforward imputation using the median or mean might not necessarily be the right answer.

Several theories could explain the occurrence of NAs:

  • Lack of Permission or Access: The applicant may not have authorized the lender to make these inquiries, or the lender might not have initiated an inquiry during that period.

-Data Collection Issues: Problems in the data collection or reporting process could lead to missing values.

-Applicability: This type of credit inquiry might not have been applicable or required for some applicants, resulting in fields being left blank or marked as NaN.

-Technical Errors: Issues in data processing, transfer, or storage could result in missing values.

Observing the data reveals 41,519 NAs across these columns, suggesting a substantial amount of missing information. However, the uniformity in the number of NAs across different columns suggests it might not be coincidental, indicating it could be the same individuals who either did not respond or were unable to respond. This pattern hints at potential issues with permission or applicability, where inquiries were possibly made selectively based on specific criteria. Given the considerable number of responses and the apparent non-randomness of the NAs, it may be prudent to exclude these missing values from certain analyses. Imputing these missing values could introduce bias, especially if the NAs result from systematic differences in the applicability of inquiries or permission granted by the applicants.

Impact of Imputation

Descriptive Stat Columns

There is a difference in the descriptive stat colums, such as the ones that include median, mode, and average. With these stats, there doesnt seem to be a connection in terms of numbers of NA’s like the previous category. For these they will take more thought to what to do than to remove them from the dataset.

Lets chart them and see what happens.

-AVG:

# Select collumns 
avg <- application_train %>%
  select(APARTMENTS_AVG, BASEMENTAREA_AVG, YEARS_BEGINEXPLUATATION_AVG,
         YEARS_BUILD_AVG, COMMONAREA_AVG, ELEVATORS_AVG, ENTRANCES_AVG,
         FLOORSMAX_AVG, FLOORSMIN_AVG, LANDAREA_AVG)

# Reshape for faceting
data_long <- pivot_longer(avg, 
                          cols = everything(), 
                          names_to = "Variable", 
                          values_to = "Value")

# Plot the distributions 
p <- ggplot(data_long, aes(x = Value)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  facet_wrap(~ Variable, scales = "free_x") +
  theme_minimal() +
  labs(x = "Value", y = "Count") +
  ggtitle("Distribution of Average Variables")

# Print the plot
print(p)

Seems like most of these distributions have a skew. With the skew, might make sense to impute a median because mean would be sensitive to the larger values. I think because there is no clear understanding of why these are NA’s it makes sense to impute rather than blindly remove them from the dataset.

-Median:

# Select columns 
medi <- application_train %>%
  select(APARTMENTS_MEDI, BASEMENTAREA_MEDI, YEARS_BEGINEXPLUATATION_MEDI,
         YEARS_BUILD_MEDI, COMMONAREA_MEDI, ELEVATORS_MEDI, ENTRANCES_MEDI,
         FLOORSMAX_MEDI, FLOORSMIN_MEDI, LANDAREA_MEDI,
         LIVINGAPARTMENTS_MEDI, LIVINGAREA_MEDI, NONLIVINGAPARTMENTS_MEDI,
         NONLIVINGAREA_MEDI)

# Reshape for faceting
data_long_medi <- pivot_longer(medi, 
                          cols = everything(), 
                          names_to = "Variable", 
                          values_to = "Value")

# Plot distributions 
p_medi <- ggplot(data_long_medi, aes(x = Value)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  facet_wrap(~ Variable, scales = "free_x", ncol = 3) + 
  theme_minimal() +
  labs(x = "Value", y = "Count") +
  ggtitle("Distribution of Median Variables")

# Print the plot
print(p_medi)

Same thing with the medians minus maybe years_build_medi.

-Mode

character_columns <- names(application_train)[sapply(application_train, is.character)]
print(character_columns)
##  [1] "NAME_CONTRACT_TYPE"         "CODE_GENDER"               
##  [3] "FLAG_OWN_CAR"               "FLAG_OWN_REALTY"           
##  [5] "NAME_TYPE_SUITE"            "NAME_INCOME_TYPE"          
##  [7] "NAME_EDUCATION_TYPE"        "NAME_FAMILY_STATUS"        
##  [9] "NAME_HOUSING_TYPE"          "OCCUPATION_TYPE"           
## [11] "WEEKDAY_APPR_PROCESS_START" "ORGANIZATION_TYPE"         
## [13] "FONDKAPREMONT_MODE"         "HOUSETYPE_MODE"            
## [15] "WALLSMATERIAL_MODE"         "EMERGENCYSTATE_MODE"
head(application_train$WALLSMATERIAL_MODE)
## [1] "Stone, brick" "Block"        ""             ""             ""            
## [6] ""
# Select columns 
mode <- application_train %>%
  select(APARTMENTS_MODE, BASEMENTAREA_MODE, YEARS_BEGINEXPLUATATION_MODE,
         YEARS_BUILD_MODE, COMMONAREA_MODE, ELEVATORS_MODE, ENTRANCES_MODE,
         FLOORSMAX_MODE, FLOORSMIN_MODE, LANDAREA_MODE,
         LIVINGAPARTMENTS_MODE, LIVINGAREA_MODE, NONLIVINGAPARTMENTS_MODE,
         NONLIVINGAREA_MODE, TOTALAREA_MODE)

# Reshape for faceting
data_long_mode <- pivot_longer(mode, 
                          cols = everything(), 
                          names_to = "Variable", 
                          values_to = "Value")

# Plot the distributions 
p_mode <- ggplot(data_long_mode, aes(x = Value)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  facet_wrap(~ Variable, scales = "free_x", ncol = 3) + 
  theme_minimal() +
  labs(x = "Value", y = "Count") +
  ggtitle("Distribution of Mode Variables")

# Print the plot
print(p_mode)

So minus the columns that have character variables which I believe I am not going to use, these are once again the same concept. I am thinking median over mean for imputation.

Data Questions

Target Variable Information

  1. Let’s take a look at the Target variable and see the distribution

First, lets graph it.

# give percentages
percentage_data <- application_train %>%
  count(TARGET) %>% #count of occurances 
  mutate(Percentage = n / sum(n) * 100) # percentage

# bar chart
percentage_data %>%
  ggplot(aes(x = as.factor(TARGET), y = Percentage, fill = as.factor(TARGET))) +
  geom_bar(stat = "identity") +
  labs(x = "Target", y = "Percentage", fill = "Target") +
  scale_y_continuous(labels = scales::percent_format()) +
  geom_text(aes(label = sprintf("%.2f%%", Percentage)), vjust = -0.5, size = 3.5) +
  theme_minimal() +
  theme(legend.position = "none") 

1 are people with payment issues including late payments for x amount of days. 0 means all other cases. This shows that the majority of people actually applying have pretty good financial backing. This is an imbalance in the target variable, which is good for the customers, but not maybe perfect for the model. The imbalance suggests that any predictive models built to predict this target will need to be carefully calibrated to handle this imbalance. Otherwise, the model might be biased towards predicting the majority class. We might need to oversample the minority 1 class and undersample the majority 0 variable to make up for it. This is not all bad though, the clear differentiation between the two classes suggests that the target variable is well-defined and likely to be a reliable label for modeling purposes.

Demographic Analysis:

Are there demographic factors (age, gender, income level, etc.) that are associated with a higher risk of default?

Age

# age days to years convert
application_train$AGE_YEARS <- -application_train$DAYS_BIRTH / 365

# Plot age vs target
ggplot(application_train, aes(x = AGE_YEARS, fill = as.factor(TARGET))) +
  geom_histogram(bins = 50, position = 'identity', alpha = 0.5) +
  labs(x = "Age (years)", y = "Count", fill = "Target") +
  ggtitle("Age Distribution by Target") +
  theme_minimal()

The histogram indicates that the majority of loan applicants are in their late 20s to early 40s. There is a noticeable decline in the number of applicants as age increases, particularly beyond 50 years. The overlay of the two histograms shows that defaults (Target 1) occur across all ages but are more concentrated in the younger age brackets. The peak for defaults appears to be among the younger applicants, with the proportion of defaults gradually decreasing as age increases.

Income Level

# plot the income level against the target 
ggplot(application_train, aes(x = AMT_INCOME_TOTAL, fill = as.factor(TARGET))) +
  geom_histogram(bins = 50, position = 'identity', alpha = 0.5) +
  labs(x = "Total Income", y = "Count", fill = "Target") +
  ggtitle("Income Distribution by Target") +
  theme_minimal()

This is really being affected by large outliers. There is a grouping that is making this chart unreadable and tells us nothing. For the sake of visualization and getting a deeper understanding, lets try to do it with a logarithmic scale.

ggplot(application_train, aes(x = log1p(AMT_INCOME_TOTAL), fill = as.factor(TARGET))) +
  geom_histogram(position = 'identity', alpha = 0.5, bins = 100) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Log(Total Income)", y = "Count", fill = "Target") +
  ggtitle("Income Distribution by Target (Log Scale)") +
  theme_minimal()

This looks much easier to read. Clearly, there is a rightward skew hinting that the majority of the applicants with higher risk are on the lower end of the economic spectrum. This would conceptually make sense as wealthier people probably would have at least enough line of credit to go through a standard loaning agency and certainly have the money to pay back their loans.

Conclusion

Throughout this exploratory data analysis, we should be able to create a decent model for candidates with lower credit histories. The investigation revealed key insights, particularly in demographic factors such as age and income, and their relationship with loan repayment difficulties. There is a clearly observed higher risk of default among younger applicants and identified income as a critical factor, with lower-income applicants more likely to default. These findings align with the business objective of Home Credit to empower financially underrepresented individuals while mitigating loan default risks. The pronounced class imbalance in the target variable has highlighted the necessity for advanced modeling techniques to ensure fair and accurate predictions. Moving forward, I recommend employing strategies like re-sampling or model re-weighting to address the imbalance issue. Additionally, further analysis could incorporate external economic factors to refine the predictions and support Home Credit in achieving financial inclusion without compromising on the integrity of the loan process.