• Description: This business problem that is being addressed is how to credit check and qualify a candidate for a loan with minimal or no credit history. Currently, there is no mechanism for people who haven’t had the time or opportunity to build credit with the use of credit card. However, Home Credit is trying to find a way for these underrepresented lost population of members of the economy to be counted, while also protecting themselves from too risky loans.
• Impact Analysis: The impact of this analysis is very large. Finding good candidates can not only help the company but the customers as well. As far as the company is concerned, it will allow them to stay afloat without having to risk defaults. Defaults would be a curse because there would be no way for the company to recoup their investments. On the customer side, they have confirmation that their loaned money will be in good hands. They know that the company is going to stay financially solvent so they do not have to deal with higher interest rates and sub-standard creditors.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.2
application_test <- read.csv("application_test.csv")
application_train <- read.csv("application_train.csv")
skim(application_train)
Name | application_train |
Number of rows | 307511 |
Number of columns | 122 |
_______________________ | |
Column type frequency: | |
character | 16 |
numeric | 106 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
NAME_CONTRACT_TYPE | 0 | 1 | 10 | 15 | 0 | 2 | 0 |
CODE_GENDER | 0 | 1 | 1 | 3 | 0 | 3 | 0 |
FLAG_OWN_CAR | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
FLAG_OWN_REALTY | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
NAME_TYPE_SUITE | 0 | 1 | 0 | 15 | 1292 | 8 | 0 |
NAME_INCOME_TYPE | 0 | 1 | 7 | 20 | 0 | 8 | 0 |
NAME_EDUCATION_TYPE | 0 | 1 | 15 | 29 | 0 | 5 | 0 |
NAME_FAMILY_STATUS | 0 | 1 | 5 | 20 | 0 | 6 | 0 |
NAME_HOUSING_TYPE | 0 | 1 | 12 | 19 | 0 | 6 | 0 |
OCCUPATION_TYPE | 0 | 1 | 0 | 21 | 96391 | 19 | 0 |
WEEKDAY_APPR_PROCESS_START | 0 | 1 | 6 | 9 | 0 | 7 | 0 |
ORGANIZATION_TYPE | 0 | 1 | 3 | 22 | 0 | 58 | 0 |
FONDKAPREMONT_MODE | 0 | 1 | 0 | 21 | 210295 | 5 | 0 |
HOUSETYPE_MODE | 0 | 1 | 0 | 16 | 154297 | 4 | 0 |
WALLSMATERIAL_MODE | 0 | 1 | 0 | 12 | 156341 | 8 | 0 |
EMERGENCYSTATE_MODE | 0 | 1 | 0 | 3 | 145755 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
SK_ID_CURR | 0 | 1.00 | 278180.52 | 102790.18 | 100002.00 | 189145.50 | 278202.00 | 367142.50 | 456255.00 | ▇▇▇▇▇ |
TARGET | 0 | 1.00 | 0.08 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
CNT_CHILDREN | 0 | 1.00 | 0.42 | 0.72 | 0.00 | 0.00 | 0.00 | 1.00 | 19.00 | ▇▁▁▁▁ |
AMT_INCOME_TOTAL | 0 | 1.00 | 168797.92 | 237123.15 | 25650.00 | 112500.00 | 147150.00 | 202500.00 | 117000000.00 | ▇▁▁▁▁ |
AMT_CREDIT | 0 | 1.00 | 599026.00 | 402490.78 | 45000.00 | 270000.00 | 513531.00 | 808650.00 | 4050000.00 | ▇▂▁▁▁ |
AMT_ANNUITY | 12 | 1.00 | 27108.57 | 14493.74 | 1615.50 | 16524.00 | 24903.00 | 34596.00 | 258025.50 | ▇▁▁▁▁ |
AMT_GOODS_PRICE | 278 | 1.00 | 538396.21 | 369446.46 | 40500.00 | 238500.00 | 450000.00 | 679500.00 | 4050000.00 | ▇▂▁▁▁ |
REGION_POPULATION_RELATIVE | 0 | 1.00 | 0.02 | 0.01 | 0.00 | 0.01 | 0.02 | 0.03 | 0.07 | ▇▇▃▁▁ |
DAYS_BIRTH | 0 | 1.00 | -16037.00 | 4363.99 | -25229.00 | -19682.00 | -15750.00 | -12413.00 | -7489.00 | ▃▆▇▇▅ |
DAYS_EMPLOYED | 0 | 1.00 | 63815.05 | 141275.77 | -17912.00 | -2760.00 | -1213.00 | -289.00 | 365243.00 | ▇▁▁▁▂ |
DAYS_REGISTRATION | 0 | 1.00 | -4986.12 | 3522.89 | -24672.00 | -7479.50 | -4504.00 | -2010.00 | 0.00 | ▁▁▂▅▇ |
DAYS_ID_PUBLISH | 0 | 1.00 | -2994.20 | 1509.45 | -7197.00 | -4299.00 | -3254.00 | -1720.00 | 0.00 | ▁▆▇▆▅ |
OWN_CAR_AGE | 202929 | 0.34 | 12.06 | 11.94 | 0.00 | 5.00 | 9.00 | 15.00 | 91.00 | ▇▁▁▁▁ |
FLAG_MOBIL | 0 | 1.00 | 1.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
FLAG_EMP_PHONE | 0 | 1.00 | 0.82 | 0.38 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▂▁▁▁▇ |
FLAG_WORK_PHONE | 0 | 1.00 | 0.20 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
FLAG_CONT_MOBILE | 0 | 1.00 | 1.00 | 0.04 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
FLAG_PHONE | 0 | 1.00 | 0.28 | 0.45 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
FLAG_EMAIL | 0 | 1.00 | 0.06 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
CNT_FAM_MEMBERS | 2 | 1.00 | 2.15 | 0.91 | 1.00 | 2.00 | 2.00 | 3.00 | 20.00 | ▇▁▁▁▁ |
REGION_RATING_CLIENT | 0 | 1.00 | 2.05 | 0.51 | 1.00 | 2.00 | 2.00 | 2.00 | 3.00 | ▁▁▇▁▂ |
REGION_RATING_CLIENT_W_CITY | 0 | 1.00 | 2.03 | 0.50 | 1.00 | 2.00 | 2.00 | 2.00 | 3.00 | ▁▁▇▁▂ |
HOUR_APPR_PROCESS_START | 0 | 1.00 | 12.06 | 3.27 | 0.00 | 10.00 | 12.00 | 14.00 | 23.00 | ▁▃▇▆▁ |
REG_REGION_NOT_LIVE_REGION | 0 | 1.00 | 0.02 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
REG_REGION_NOT_WORK_REGION | 0 | 1.00 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
LIVE_REGION_NOT_WORK_REGION | 0 | 1.00 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
REG_CITY_NOT_LIVE_CITY | 0 | 1.00 | 0.08 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
REG_CITY_NOT_WORK_CITY | 0 | 1.00 | 0.23 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
LIVE_CITY_NOT_WORK_CITY | 0 | 1.00 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
EXT_SOURCE_1 | 173378 | 0.44 | 0.50 | 0.21 | 0.01 | 0.33 | 0.51 | 0.68 | 0.96 | ▂▇▇▇▃ |
EXT_SOURCE_2 | 660 | 1.00 | 0.51 | 0.19 | 0.00 | 0.39 | 0.57 | 0.66 | 0.85 | ▂▂▃▇▃ |
EXT_SOURCE_3 | 60965 | 0.80 | 0.51 | 0.19 | 0.00 | 0.37 | 0.54 | 0.67 | 0.90 | ▂▃▆▇▃ |
APARTMENTS_AVG | 156061 | 0.49 | 0.12 | 0.11 | 0.00 | 0.06 | 0.09 | 0.15 | 1.00 | ▇▁▁▁▁ |
BASEMENTAREA_AVG | 179943 | 0.41 | 0.09 | 0.08 | 0.00 | 0.04 | 0.08 | 0.11 | 1.00 | ▇▁▁▁▁ |
YEARS_BEGINEXPLUATATION_AVG | 150007 | 0.51 | 0.98 | 0.06 | 0.00 | 0.98 | 0.98 | 0.99 | 1.00 | ▁▁▁▁▇ |
YEARS_BUILD_AVG | 204488 | 0.34 | 0.75 | 0.11 | 0.00 | 0.69 | 0.76 | 0.82 | 1.00 | ▁▁▁▇▅ |
COMMONAREA_AVG | 214865 | 0.30 | 0.04 | 0.08 | 0.00 | 0.01 | 0.02 | 0.05 | 1.00 | ▇▁▁▁▁ |
ELEVATORS_AVG | 163891 | 0.47 | 0.08 | 0.13 | 0.00 | 0.00 | 0.00 | 0.12 | 1.00 | ▇▁▁▁▁ |
ENTRANCES_AVG | 154828 | 0.50 | 0.15 | 0.10 | 0.00 | 0.07 | 0.14 | 0.21 | 1.00 | ▇▃▁▁▁ |
FLOORSMAX_AVG | 153020 | 0.50 | 0.23 | 0.14 | 0.00 | 0.17 | 0.17 | 0.33 | 1.00 | ▇▃▁▁▁ |
FLOORSMIN_AVG | 208642 | 0.32 | 0.23 | 0.16 | 0.00 | 0.08 | 0.21 | 0.38 | 1.00 | ▅▇▁▁▁ |
LANDAREA_AVG | 182590 | 0.41 | 0.07 | 0.08 | 0.00 | 0.02 | 0.05 | 0.09 | 1.00 | ▇▁▁▁▁ |
LIVINGAPARTMENTS_AVG | 210199 | 0.32 | 0.10 | 0.09 | 0.00 | 0.05 | 0.08 | 0.12 | 1.00 | ▇▁▁▁▁ |
LIVINGAREA_AVG | 154350 | 0.50 | 0.11 | 0.11 | 0.00 | 0.05 | 0.07 | 0.13 | 1.00 | ▇▁▁▁▁ |
NONLIVINGAPARTMENTS_AVG | 213514 | 0.31 | 0.01 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
NONLIVINGAREA_AVG | 169682 | 0.45 | 0.03 | 0.07 | 0.00 | 0.00 | 0.00 | 0.03 | 1.00 | ▇▁▁▁▁ |
APARTMENTS_MODE | 156061 | 0.49 | 0.11 | 0.11 | 0.00 | 0.05 | 0.08 | 0.14 | 1.00 | ▇▁▁▁▁ |
BASEMENTAREA_MODE | 179943 | 0.41 | 0.09 | 0.08 | 0.00 | 0.04 | 0.07 | 0.11 | 1.00 | ▇▁▁▁▁ |
YEARS_BEGINEXPLUATATION_MODE | 150007 | 0.51 | 0.98 | 0.06 | 0.00 | 0.98 | 0.98 | 0.99 | 1.00 | ▁▁▁▁▇ |
YEARS_BUILD_MODE | 204488 | 0.34 | 0.76 | 0.11 | 0.00 | 0.70 | 0.76 | 0.82 | 1.00 | ▁▁▁▇▅ |
COMMONAREA_MODE | 214865 | 0.30 | 0.04 | 0.07 | 0.00 | 0.01 | 0.02 | 0.05 | 1.00 | ▇▁▁▁▁ |
ELEVATORS_MODE | 163891 | 0.47 | 0.07 | 0.13 | 0.00 | 0.00 | 0.00 | 0.12 | 1.00 | ▇▁▁▁▁ |
ENTRANCES_MODE | 154828 | 0.50 | 0.15 | 0.10 | 0.00 | 0.07 | 0.14 | 0.21 | 1.00 | ▇▂▁▁▁ |
FLOORSMAX_MODE | 153020 | 0.50 | 0.22 | 0.14 | 0.00 | 0.17 | 0.17 | 0.33 | 1.00 | ▇▃▁▁▁ |
FLOORSMIN_MODE | 208642 | 0.32 | 0.23 | 0.16 | 0.00 | 0.08 | 0.21 | 0.38 | 1.00 | ▅▇▁▁▁ |
LANDAREA_MODE | 182590 | 0.41 | 0.06 | 0.08 | 0.00 | 0.02 | 0.05 | 0.08 | 1.00 | ▇▁▁▁▁ |
LIVINGAPARTMENTS_MODE | 210199 | 0.32 | 0.11 | 0.10 | 0.00 | 0.05 | 0.08 | 0.13 | 1.00 | ▇▁▁▁▁ |
LIVINGAREA_MODE | 154350 | 0.50 | 0.11 | 0.11 | 0.00 | 0.04 | 0.07 | 0.13 | 1.00 | ▇▁▁▁▁ |
NONLIVINGAPARTMENTS_MODE | 213514 | 0.31 | 0.01 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
NONLIVINGAREA_MODE | 169682 | 0.45 | 0.03 | 0.07 | 0.00 | 0.00 | 0.00 | 0.02 | 1.00 | ▇▁▁▁▁ |
APARTMENTS_MEDI | 156061 | 0.49 | 0.12 | 0.11 | 0.00 | 0.06 | 0.09 | 0.15 | 1.00 | ▇▁▁▁▁ |
BASEMENTAREA_MEDI | 179943 | 0.41 | 0.09 | 0.08 | 0.00 | 0.04 | 0.08 | 0.11 | 1.00 | ▇▁▁▁▁ |
YEARS_BEGINEXPLUATATION_MEDI | 150007 | 0.51 | 0.98 | 0.06 | 0.00 | 0.98 | 0.98 | 0.99 | 1.00 | ▁▁▁▁▇ |
YEARS_BUILD_MEDI | 204488 | 0.34 | 0.76 | 0.11 | 0.00 | 0.69 | 0.76 | 0.83 | 1.00 | ▁▁▁▇▅ |
COMMONAREA_MEDI | 214865 | 0.30 | 0.04 | 0.08 | 0.00 | 0.01 | 0.02 | 0.05 | 1.00 | ▇▁▁▁▁ |
ELEVATORS_MEDI | 163891 | 0.47 | 0.08 | 0.13 | 0.00 | 0.00 | 0.00 | 0.12 | 1.00 | ▇▁▁▁▁ |
ENTRANCES_MEDI | 154828 | 0.50 | 0.15 | 0.10 | 0.00 | 0.07 | 0.14 | 0.21 | 1.00 | ▇▃▁▁▁ |
FLOORSMAX_MEDI | 153020 | 0.50 | 0.23 | 0.15 | 0.00 | 0.17 | 0.17 | 0.33 | 1.00 | ▇▃▁▁▁ |
FLOORSMIN_MEDI | 208642 | 0.32 | 0.23 | 0.16 | 0.00 | 0.08 | 0.21 | 0.38 | 1.00 | ▅▇▁▁▁ |
LANDAREA_MEDI | 182590 | 0.41 | 0.07 | 0.08 | 0.00 | 0.02 | 0.05 | 0.09 | 1.00 | ▇▁▁▁▁ |
LIVINGAPARTMENTS_MEDI | 210199 | 0.32 | 0.10 | 0.09 | 0.00 | 0.05 | 0.08 | 0.12 | 1.00 | ▇▁▁▁▁ |
LIVINGAREA_MEDI | 154350 | 0.50 | 0.11 | 0.11 | 0.00 | 0.05 | 0.07 | 0.13 | 1.00 | ▇▁▁▁▁ |
NONLIVINGAPARTMENTS_MEDI | 213514 | 0.31 | 0.01 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
NONLIVINGAREA_MEDI | 169682 | 0.45 | 0.03 | 0.07 | 0.00 | 0.00 | 0.00 | 0.03 | 1.00 | ▇▁▁▁▁ |
TOTALAREA_MODE | 148431 | 0.52 | 0.10 | 0.11 | 0.00 | 0.04 | 0.07 | 0.13 | 1.00 | ▇▁▁▁▁ |
OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 1.00 | 1.42 | 2.40 | 0.00 | 0.00 | 0.00 | 2.00 | 348.00 | ▇▁▁▁▁ |
DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 1.00 | 0.14 | 0.45 | 0.00 | 0.00 | 0.00 | 0.00 | 34.00 | ▇▁▁▁▁ |
OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 1.00 | 1.41 | 2.38 | 0.00 | 0.00 | 0.00 | 2.00 | 344.00 | ▇▁▁▁▁ |
DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 1.00 | 0.10 | 0.36 | 0.00 | 0.00 | 0.00 | 0.00 | 24.00 | ▇▁▁▁▁ |
DAYS_LAST_PHONE_CHANGE | 1 | 1.00 | -962.86 | 826.81 | -4292.00 | -1570.00 | -757.00 | -274.00 | 0.00 | ▁▁▂▃▇ |
FLAG_DOCUMENT_2 | 0 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_3 | 0 | 1.00 | 0.71 | 0.45 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
FLAG_DOCUMENT_4 | 0 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_5 | 0 | 1.00 | 0.02 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_6 | 0 | 1.00 | 0.09 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_7 | 0 | 1.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_8 | 0 | 1.00 | 0.08 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_9 | 0 | 1.00 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_10 | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_11 | 0 | 1.00 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_12 | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_13 | 0 | 1.00 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_14 | 0 | 1.00 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_15 | 0 | 1.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_16 | 0 | 1.00 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_17 | 0 | 1.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_18 | 0 | 1.00 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_19 | 0 | 1.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_20 | 0 | 1.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
FLAG_DOCUMENT_21 | 0 | 1.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 0.86 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | ▇▁▁▁▁ |
AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 0.86 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 9.00 | ▇▁▁▁▁ |
AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 0.86 | 0.03 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 8.00 | ▇▁▁▁▁ |
AMT_REQ_CREDIT_BUREAU_MON | 41519 | 0.86 | 0.27 | 0.92 | 0.00 | 0.00 | 0.00 | 0.00 | 27.00 | ▇▁▁▁▁ |
AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 0.86 | 0.27 | 0.79 | 0.00 | 0.00 | 0.00 | 0.00 | 261.00 | ▇▁▁▁▁ |
AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 0.86 | 1.90 | 1.87 | 0.00 | 0.00 | 1.00 | 3.00 | 25.00 | ▇▁▁▁▁ |
I want to combine some similar sections of the data to see if there are a grouped way of handling the missing data.
# combining all columns then save to a data frame
amt_req <- application_train %>%
select(AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_MON,
AMT_REQ_CREDIT_BUREAU_QRT, AMT_REQ_CREDIT_BUREAU_WEEK, AMT_REQ_CREDIT_BUREAU_YEAR) %>%
head(15)
# list unique values
list_unique_values <- list(
AMT_REQ_CREDIT_BUREAU_DAY = unique(application_train$AMT_REQ_CREDIT_BUREAU_DAY),
AMT_REQ_CREDIT_BUREAU_HOUR = unique(application_train$AMT_REQ_CREDIT_BUREAU_HOUR),
AMT_REQ_CREDIT_BUREAU_MON = unique(application_train$AMT_REQ_CREDIT_BUREAU_MON),
AMT_REQ_CREDIT_BUREAU_QRT = unique(application_train$AMT_REQ_CREDIT_BUREAU_QRT),
AMT_REQ_CREDIT_BUREAU_WEEK = unique(application_train$AMT_REQ_CREDIT_BUREAU_WEEK),
AMT_REQ_CREDIT_BUREAU_YEAR = unique(application_train$AMT_REQ_CREDIT_BUREAU_YEAR)
)
list_unique_values
## $AMT_REQ_CREDIT_BUREAU_DAY
## [1] 0 NA 1 3 2 4 5 6 9 8
##
## $AMT_REQ_CREDIT_BUREAU_HOUR
## [1] 0 NA 1 2 3 4
##
## $AMT_REQ_CREDIT_BUREAU_MON
## [1] 0 NA 1 2 6 5 3 7 9 4 11 8 16 12 14 10 13 17 24 19 15 23 18 27 22
##
## $AMT_REQ_CREDIT_BUREAU_QRT
## [1] 0 NA 1 2 4 3 8 5 6 7 261 19
##
## $AMT_REQ_CREDIT_BUREAU_WEEK
## [1] 0 NA 1 3 2 4 5 6 8 7
##
## $AMT_REQ_CREDIT_BUREAU_YEAR
## [1] 1 0 NA 2 4 5 3 8 6 9 7 10 11 13 16 12 25 23 15 14 22 17 19 18 21
## [26] 20
This column is described as the “Number of inquiries to the Credit Bureau about the client one {timeframe} before the application.” Given the significant number of NA values, alongside the valuable information contained within the non-NA values, it’s crucial to carefully consider how to handle this data. The presence of NA values raises questions, yet the data without NAs remains informative. It seems that the NAs might not be inherently informative, and straightforward imputation using the median or mean might not necessarily be the right answer.
Several theories could explain the occurrence of NAs:
-Data Collection Issues: Problems in the data collection or reporting process could lead to missing values.
-Applicability: This type of credit inquiry might not have been applicable or required for some applicants, resulting in fields being left blank or marked as NaN.
-Technical Errors: Issues in data processing, transfer, or storage could result in missing values.
Observing the data reveals 41,519 NAs across these columns, suggesting a substantial amount of missing information. However, the uniformity in the number of NAs across different columns suggests it might not be coincidental, indicating it could be the same individuals who either did not respond or were unable to respond. This pattern hints at potential issues with permission or applicability, where inquiries were possibly made selectively based on specific criteria. Given the considerable number of responses and the apparent non-randomness of the NAs, it may be prudent to exclude these missing values from certain analyses. Imputing these missing values could introduce bias, especially if the NAs result from systematic differences in the applicability of inquiries or permission granted by the applicants.
There is a difference in the descriptive stat colums, such as the ones that include median, mode, and average. With these stats, there doesnt seem to be a connection in terms of numbers of NA’s like the previous category. For these they will take more thought to what to do than to remove them from the dataset.
Lets chart them and see what happens.
-AVG:
# Select collumns
avg <- application_train %>%
select(APARTMENTS_AVG, BASEMENTAREA_AVG, YEARS_BEGINEXPLUATATION_AVG,
YEARS_BUILD_AVG, COMMONAREA_AVG, ELEVATORS_AVG, ENTRANCES_AVG,
FLOORSMAX_AVG, FLOORSMIN_AVG, LANDAREA_AVG)
# Reshape for faceting
data_long <- pivot_longer(avg,
cols = everything(),
names_to = "Variable",
values_to = "Value")
# Plot the distributions
p <- ggplot(data_long, aes(x = Value)) +
geom_histogram(bins = 30, na.rm = TRUE) +
facet_wrap(~ Variable, scales = "free_x") +
theme_minimal() +
labs(x = "Value", y = "Count") +
ggtitle("Distribution of Average Variables")
# Print the plot
print(p)
Seems like most of these distributions have a skew. With the skew, might
make sense to impute a median because mean would be sensitive to the
larger values. I think because there is no clear understanding of why
these are NA’s it makes sense to impute rather than blindly remove them
from the dataset.
-Median:
# Select columns
medi <- application_train %>%
select(APARTMENTS_MEDI, BASEMENTAREA_MEDI, YEARS_BEGINEXPLUATATION_MEDI,
YEARS_BUILD_MEDI, COMMONAREA_MEDI, ELEVATORS_MEDI, ENTRANCES_MEDI,
FLOORSMAX_MEDI, FLOORSMIN_MEDI, LANDAREA_MEDI,
LIVINGAPARTMENTS_MEDI, LIVINGAREA_MEDI, NONLIVINGAPARTMENTS_MEDI,
NONLIVINGAREA_MEDI)
# Reshape for faceting
data_long_medi <- pivot_longer(medi,
cols = everything(),
names_to = "Variable",
values_to = "Value")
# Plot distributions
p_medi <- ggplot(data_long_medi, aes(x = Value)) +
geom_histogram(bins = 30, na.rm = TRUE) +
facet_wrap(~ Variable, scales = "free_x", ncol = 3) +
theme_minimal() +
labs(x = "Value", y = "Count") +
ggtitle("Distribution of Median Variables")
# Print the plot
print(p_medi)
Same thing with the medians minus maybe years_build_medi.
-Mode
character_columns <- names(application_train)[sapply(application_train, is.character)]
print(character_columns)
## [1] "NAME_CONTRACT_TYPE" "CODE_GENDER"
## [3] "FLAG_OWN_CAR" "FLAG_OWN_REALTY"
## [5] "NAME_TYPE_SUITE" "NAME_INCOME_TYPE"
## [7] "NAME_EDUCATION_TYPE" "NAME_FAMILY_STATUS"
## [9] "NAME_HOUSING_TYPE" "OCCUPATION_TYPE"
## [11] "WEEKDAY_APPR_PROCESS_START" "ORGANIZATION_TYPE"
## [13] "FONDKAPREMONT_MODE" "HOUSETYPE_MODE"
## [15] "WALLSMATERIAL_MODE" "EMERGENCYSTATE_MODE"
head(application_train$WALLSMATERIAL_MODE)
## [1] "Stone, brick" "Block" "" "" ""
## [6] ""
# Select columns
mode <- application_train %>%
select(APARTMENTS_MODE, BASEMENTAREA_MODE, YEARS_BEGINEXPLUATATION_MODE,
YEARS_BUILD_MODE, COMMONAREA_MODE, ELEVATORS_MODE, ENTRANCES_MODE,
FLOORSMAX_MODE, FLOORSMIN_MODE, LANDAREA_MODE,
LIVINGAPARTMENTS_MODE, LIVINGAREA_MODE, NONLIVINGAPARTMENTS_MODE,
NONLIVINGAREA_MODE, TOTALAREA_MODE)
# Reshape for faceting
data_long_mode <- pivot_longer(mode,
cols = everything(),
names_to = "Variable",
values_to = "Value")
# Plot the distributions
p_mode <- ggplot(data_long_mode, aes(x = Value)) +
geom_histogram(bins = 30, na.rm = TRUE) +
facet_wrap(~ Variable, scales = "free_x", ncol = 3) +
theme_minimal() +
labs(x = "Value", y = "Count") +
ggtitle("Distribution of Mode Variables")
# Print the plot
print(p_mode)
So minus the columns that have character variables which I believe I am not going to use, these are once again the same concept. I am thinking median over mean for imputation.
First, lets graph it.
# give percentages
percentage_data <- application_train %>%
count(TARGET) %>% #count of occurances
mutate(Percentage = n / sum(n) * 100) # percentage
# bar chart
percentage_data %>%
ggplot(aes(x = as.factor(TARGET), y = Percentage, fill = as.factor(TARGET))) +
geom_bar(stat = "identity") +
labs(x = "Target", y = "Percentage", fill = "Target") +
scale_y_continuous(labels = scales::percent_format()) +
geom_text(aes(label = sprintf("%.2f%%", Percentage)), vjust = -0.5, size = 3.5) +
theme_minimal() +
theme(legend.position = "none")
1 are people with payment issues including late payments for x amount of days. 0 means all other cases. This shows that the majority of people actually applying have pretty good financial backing. This is an imbalance in the target variable, which is good for the customers, but not maybe perfect for the model. The imbalance suggests that any predictive models built to predict this target will need to be carefully calibrated to handle this imbalance. Otherwise, the model might be biased towards predicting the majority class. We might need to oversample the minority 1 class and undersample the majority 0 variable to make up for it. This is not all bad though, the clear differentiation between the two classes suggests that the target variable is well-defined and likely to be a reliable label for modeling purposes.
Are there demographic factors (age, gender, income level, etc.) that are associated with a higher risk of default?
# age days to years convert
application_train$AGE_YEARS <- -application_train$DAYS_BIRTH / 365
# Plot age vs target
ggplot(application_train, aes(x = AGE_YEARS, fill = as.factor(TARGET))) +
geom_histogram(bins = 50, position = 'identity', alpha = 0.5) +
labs(x = "Age (years)", y = "Count", fill = "Target") +
ggtitle("Age Distribution by Target") +
theme_minimal()
The histogram indicates that the majority of loan applicants are in
their late 20s to early 40s. There is a noticeable decline in the number
of applicants as age increases, particularly beyond 50 years. The
overlay of the two histograms shows that defaults (Target 1) occur
across all ages but are more concentrated in the younger age brackets.
The peak for defaults appears to be among the younger applicants, with
the proportion of defaults gradually decreasing as age increases.
# plot the income level against the target
ggplot(application_train, aes(x = AMT_INCOME_TOTAL, fill = as.factor(TARGET))) +
geom_histogram(bins = 50, position = 'identity', alpha = 0.5) +
labs(x = "Total Income", y = "Count", fill = "Target") +
ggtitle("Income Distribution by Target") +
theme_minimal()
This is really being affected by large outliers. There is a grouping that is making this chart unreadable and tells us nothing. For the sake of visualization and getting a deeper understanding, lets try to do it with a logarithmic scale.
ggplot(application_train, aes(x = log1p(AMT_INCOME_TOTAL), fill = as.factor(TARGET))) +
geom_histogram(position = 'identity', alpha = 0.5, bins = 100) +
scale_x_continuous(labels = scales::comma) +
labs(x = "Log(Total Income)", y = "Count", fill = "Target") +
ggtitle("Income Distribution by Target (Log Scale)") +
theme_minimal()
This looks much easier to read. Clearly, there is a rightward skew hinting that the majority of the applicants with higher risk are on the lower end of the economic spectrum. This would conceptually make sense as wealthier people probably would have at least enough line of credit to go through a standard loaning agency and certainly have the money to pay back their loans.
Throughout this exploratory data analysis, we should be able to create a decent model for candidates with lower credit histories. The investigation revealed key insights, particularly in demographic factors such as age and income, and their relationship with loan repayment difficulties. There is a clearly observed higher risk of default among younger applicants and identified income as a critical factor, with lower-income applicants more likely to default. These findings align with the business objective of Home Credit to empower financially underrepresented individuals while mitigating loan default risks. The pronounced class imbalance in the target variable has highlighted the necessity for advanced modeling techniques to ensure fair and accurate predictions. Moving forward, I recommend employing strategies like re-sampling or model re-weighting to address the imbalance issue. Additionally, further analysis could incorporate external economic factors to refine the predictions and support Home Credit in achieving financial inclusion without compromising on the integrity of the loan process.