Intro

Data Information

Handling Missing data
Impact of Imputation

Data Questions

Target Variable Information
Demographic Analysis:

Conclusion

Intro

Business Problem

• Description: This business problem that is being addressed is how to credit check and qualify a candidate for a loan with minimal or no credit history. Currently, there is no mechanism for people who haven’t had the time or opportunity to build credit with the use of credit card. However, Home Credit is trying to find a way for these underrepresented lost population of members of the economy to be counted, while also protecting themselves from too risky loans.

• Impact Analysis: The impact of this analysis is very large. Finding good candidates can not only help the company but the customers as well. As far as the company is concerned, it will allow them to stay afloat without having to risk defaults. Defaults would be a curse because there would be no way for the company to recoup their investments. On the customer side, they have confirmation that their loaned money will be in good hands. They know that the company is going to stay financially solvent so they do not have to deal with higher interest rates and sub-standard creditors.

Data Information

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.2

## Warning: package 'dplyr' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)

## Warning: package 'skimr' was built under R version 4.3.2

application_test <- read.csv("application_test.csv")
application_train <- read.csv("application_train.csv")

skim(application_train)

Data summary
Name	application_train
Number of rows	307511
Number of columns	122
_______________________
Column type frequency:
character	16
numeric	106
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
NAME_CONTRACT_TYPE	1	10	15	0	2
CODE_GENDER	1	1	3	0	3
FLAG_OWN_CAR	1	1	1	0	2
FLAG_OWN_REALTY	1	1	1	0	2
NAME_TYPE_SUITE	1	0	15	1292	8
NAME_INCOME_TYPE	1	7	20	0	8
NAME_EDUCATION_TYPE	1	15	29	0	5
NAME_FAMILY_STATUS	1	5	20	0	6
NAME_HOUSING_TYPE	1	12	19	0	6
OCCUPATION_TYPE	1	0	21	96391	19
WEEKDAY_APPR_PROCESS_START	1	6	9	0	7
ORGANIZATION_TYPE	1	3	22	0	58
FONDKAPREMONT_MODE	1	0	21	210295	5
HOUSETYPE_MODE	1	0	16	154297	4
WALLSMATERIAL_MODE	1	0	12	156341	8
EMERGENCYSTATE_MODE	1	0	3	145755	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
SK_ID_CURR	0	1.00	278180.52	102790.18	100002.00	189145.50	278202.00	367142.50	456255.00	▇▇▇▇▇
TARGET	0	1.00	0.08	0.27	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
CNT_CHILDREN	0	1.00	0.42	0.72	0.00	0.00	0.00	1.00	19.00	▇▁▁▁▁
AMT_INCOME_TOTAL	0	1.00	168797.92	237123.15	25650.00	112500.00	147150.00	202500.00	117000000.00	▇▁▁▁▁
AMT_CREDIT	0	1.00	599026.00	402490.78	45000.00	270000.00	513531.00	808650.00	4050000.00	▇▂▁▁▁
AMT_ANNUITY	12	1.00	27108.57	14493.74	1615.50	16524.00	24903.00	34596.00	258025.50	▇▁▁▁▁
AMT_GOODS_PRICE	278	1.00	538396.21	369446.46	40500.00	238500.00	450000.00	679500.00	4050000.00	▇▂▁▁▁
REGION_POPULATION_RELATIVE	0	1.00	0.02	0.01	0.00	0.01	0.02	0.03	0.07	▇▇▃▁▁
DAYS_BIRTH	0	1.00	-16037.00	4363.99	-25229.00	-19682.00	-15750.00	-12413.00	-7489.00	▃▆▇▇▅
DAYS_EMPLOYED	0	1.00	63815.05	141275.77	-17912.00	-2760.00	-1213.00	-289.00	365243.00	▇▁▁▁▂
DAYS_REGISTRATION	0	1.00	-4986.12	3522.89	-24672.00	-7479.50	-4504.00	-2010.00	0.00	▁▁▂▅▇
DAYS_ID_PUBLISH	0	1.00	-2994.20	1509.45	-7197.00	-4299.00	-3254.00	-1720.00	0.00	▁▆▇▆▅
OWN_CAR_AGE	202929	0.34	12.06	11.94	0.00	5.00	9.00	15.00	91.00	▇▁▁▁▁
FLAG_MOBIL	0	1.00	1.00	0.00	0.00	1.00	1.00	1.00	1.00	▁▁▁▁▇
FLAG_EMP_PHONE	0	1.00	0.82	0.38	0.00	1.00	1.00	1.00	1.00	▂▁▁▁▇
FLAG_WORK_PHONE	0	1.00	0.20	0.40	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
FLAG_CONT_MOBILE	0	1.00	1.00	0.04	0.00	1.00	1.00	1.00	1.00	▁▁▁▁▇
FLAG_PHONE	0	1.00	0.28	0.45	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▃
FLAG_EMAIL	0	1.00	0.06	0.23	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
CNT_FAM_MEMBERS	2	1.00	2.15	0.91	1.00	2.00	2.00	3.00	20.00	▇▁▁▁▁
REGION_RATING_CLIENT	0	1.00	2.05	0.51	1.00	2.00	2.00	2.00	3.00	▁▁▇▁▂
REGION_RATING_CLIENT_W_CITY	0	1.00	2.03	0.50	1.00	2.00	2.00	2.00	3.00	▁▁▇▁▂
HOUR_APPR_PROCESS_START	0	1.00	12.06	3.27	0.00	10.00	12.00	14.00	23.00	▁▃▇▆▁
REG_REGION_NOT_LIVE_REGION	0	1.00	0.02	0.12	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
REG_REGION_NOT_WORK_REGION	0	1.00	0.05	0.22	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
LIVE_REGION_NOT_WORK_REGION	0	1.00	0.04	0.20	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
REG_CITY_NOT_LIVE_CITY	0	1.00	0.08	0.27	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
REG_CITY_NOT_WORK_CITY	0	1.00	0.23	0.42	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
LIVE_CITY_NOT_WORK_CITY	0	1.00	0.18	0.38	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
EXT_SOURCE_1	173378	0.44	0.50	0.21	0.01	0.33	0.51	0.68	0.96	▂▇▇▇▃
EXT_SOURCE_2	660	1.00	0.51	0.19	0.00	0.39	0.57	0.66	0.85	▂▂▃▇▃
EXT_SOURCE_3	60965	0.80	0.51	0.19	0.00	0.37	0.54	0.67	0.90	▂▃▆▇▃
APARTMENTS_AVG	156061	0.49	0.12	0.11	0.00	0.06	0.09	0.15	1.00	▇▁▁▁▁
BASEMENTAREA_AVG	179943	0.41	0.09	0.08	0.00	0.04	0.08	0.11	1.00	▇▁▁▁▁
YEARS_BEGINEXPLUATATION_AVG	150007	0.51	0.98	0.06	0.00	0.98	0.98	0.99	1.00	▁▁▁▁▇
YEARS_BUILD_AVG	204488	0.34	0.75	0.11	0.00	0.69	0.76	0.82	1.00	▁▁▁▇▅
COMMONAREA_AVG	214865	0.30	0.04	0.08	0.00	0.01	0.02	0.05	1.00	▇▁▁▁▁
ELEVATORS_AVG	163891	0.47	0.08	0.13	0.00	0.00	0.00	0.12	1.00	▇▁▁▁▁
ENTRANCES_AVG	154828	0.50	0.15	0.10	0.00	0.07	0.14	0.21	1.00	▇▃▁▁▁
FLOORSMAX_AVG	153020	0.50	0.23	0.14	0.00	0.17	0.17	0.33	1.00	▇▃▁▁▁
FLOORSMIN_AVG	208642	0.32	0.23	0.16	0.00	0.08	0.21	0.38	1.00	▅▇▁▁▁
LANDAREA_AVG	182590	0.41	0.07	0.08	0.00	0.02	0.05	0.09	1.00	▇▁▁▁▁
LIVINGAPARTMENTS_AVG	210199	0.32	0.10	0.09	0.00	0.05	0.08	0.12	1.00	▇▁▁▁▁
LIVINGAREA_AVG	154350	0.50	0.11	0.11	0.00	0.05	0.07	0.13	1.00	▇▁▁▁▁
NONLIVINGAPARTMENTS_AVG	213514	0.31	0.01	0.05	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
NONLIVINGAREA_AVG	169682	0.45	0.03	0.07	0.00	0.00	0.00	0.03	1.00	▇▁▁▁▁
APARTMENTS_MODE	156061	0.49	0.11	0.11	0.00	0.05	0.08	0.14	1.00	▇▁▁▁▁
BASEMENTAREA_MODE	179943	0.41	0.09	0.08	0.00	0.04	0.07	0.11	1.00	▇▁▁▁▁
YEARS_BEGINEXPLUATATION_MODE	150007	0.51	0.98	0.06	0.00	0.98	0.98	0.99	1.00	▁▁▁▁▇
YEARS_BUILD_MODE	204488	0.34	0.76	0.11	0.00	0.70	0.76	0.82	1.00	▁▁▁▇▅
COMMONAREA_MODE	214865	0.30	0.04	0.07	0.00	0.01	0.02	0.05	1.00	▇▁▁▁▁
ELEVATORS_MODE	163891	0.47	0.07	0.13	0.00	0.00	0.00	0.12	1.00	▇▁▁▁▁
ENTRANCES_MODE	154828	0.50	0.15	0.10	0.00	0.07	0.14	0.21	1.00	▇▂▁▁▁
FLOORSMAX_MODE	153020	0.50	0.22	0.14	0.00	0.17	0.17	0.33	1.00	▇▃▁▁▁
FLOORSMIN_MODE	208642	0.32	0.23	0.16	0.00	0.08	0.21	0.38	1.00	▅▇▁▁▁
LANDAREA_MODE	182590	0.41	0.06	0.08	0.00	0.02	0.05	0.08	1.00	▇▁▁▁▁
LIVINGAPARTMENTS_MODE	210199	0.32	0.11	0.10	0.00	0.05	0.08	0.13	1.00	▇▁▁▁▁
LIVINGAREA_MODE	154350	0.50	0.11	0.11	0.00	0.04	0.07	0.13	1.00	▇▁▁▁▁
NONLIVINGAPARTMENTS_MODE	213514	0.31	0.01	0.05	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
NONLIVINGAREA_MODE	169682	0.45	0.03	0.07	0.00	0.00	0.00	0.02	1.00	▇▁▁▁▁
APARTMENTS_MEDI	156061	0.49	0.12	0.11	0.00	0.06	0.09	0.15	1.00	▇▁▁▁▁
BASEMENTAREA_MEDI	179943	0.41	0.09	0.08	0.00	0.04	0.08	0.11	1.00	▇▁▁▁▁
YEARS_BEGINEXPLUATATION_MEDI	150007	0.51	0.98	0.06	0.00	0.98	0.98	0.99	1.00	▁▁▁▁▇
YEARS_BUILD_MEDI	204488	0.34	0.76	0.11	0.00	0.69	0.76	0.83	1.00	▁▁▁▇▅
COMMONAREA_MEDI	214865	0.30	0.04	0.08	0.00	0.01	0.02	0.05	1.00	▇▁▁▁▁
ELEVATORS_MEDI	163891	0.47	0.08	0.13	0.00	0.00	0.00	0.12	1.00	▇▁▁▁▁
ENTRANCES_MEDI	154828	0.50	0.15	0.10	0.00	0.07	0.14	0.21	1.00	▇▃▁▁▁
FLOORSMAX_MEDI	153020	0.50	0.23	0.15	0.00	0.17	0.17	0.33	1.00	▇▃▁▁▁
FLOORSMIN_MEDI	208642	0.32	0.23	0.16	0.00	0.08	0.21	0.38	1.00	▅▇▁▁▁
LANDAREA_MEDI	182590	0.41	0.07	0.08	0.00	0.02	0.05	0.09	1.00	▇▁▁▁▁
LIVINGAPARTMENTS_MEDI	210199	0.32	0.10	0.09	0.00	0.05	0.08	0.12	1.00	▇▁▁▁▁
LIVINGAREA_MEDI	154350	0.50	0.11	0.11	0.00	0.05	0.07	0.13	1.00	▇▁▁▁▁
NONLIVINGAPARTMENTS_MEDI	213514	0.31	0.01	0.05	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
NONLIVINGAREA_MEDI	169682	0.45	0.03	0.07	0.00	0.00	0.00	0.03	1.00	▇▁▁▁▁
TOTALAREA_MODE	148431	0.52	0.10	0.11	0.00	0.04	0.07	0.13	1.00	▇▁▁▁▁
OBS_30_CNT_SOCIAL_CIRCLE	1021	1.00	1.42	2.40	0.00	0.00	0.00	2.00	348.00	▇▁▁▁▁
DEF_30_CNT_SOCIAL_CIRCLE	1021	1.00	0.14	0.45	0.00	0.00	0.00	0.00	34.00	▇▁▁▁▁
OBS_60_CNT_SOCIAL_CIRCLE	1021	1.00	1.41	2.38	0.00	0.00	0.00	2.00	344.00	▇▁▁▁▁
DEF_60_CNT_SOCIAL_CIRCLE	1021	1.00	0.10	0.36	0.00	0.00	0.00	0.00	24.00	▇▁▁▁▁
DAYS_LAST_PHONE_CHANGE	1	1.00	-962.86	826.81	-4292.00	-1570.00	-757.00	-274.00	0.00	▁▁▂▃▇
FLAG_DOCUMENT_2	0	1.00	0.00	0.01	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_3	0	1.00	0.71	0.45	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
FLAG_DOCUMENT_4	0	1.00	0.00	0.01	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_5	0	1.00	0.02	0.12	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_6	0	1.00	0.09	0.28	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_7	0	1.00	0.00	0.01	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_8	0	1.00	0.08	0.27	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_9	0	1.00	0.00	0.06	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_10	0	1.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_11	0	1.00	0.00	0.06	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_12	0	1.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_13	0	1.00	0.00	0.06	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_14	0	1.00	0.00	0.05	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_15	0	1.00	0.00	0.03	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_16	0	1.00	0.01	0.10	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_17	0	1.00	0.00	0.02	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_18	0	1.00	0.01	0.09	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_19	0	1.00	0.00	0.02	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_20	0	1.00	0.00	0.02	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
FLAG_DOCUMENT_21	0	1.00	0.00	0.02	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_HOUR	41519	0.86	0.01	0.08	0.00	0.00	0.00	0.00	4.00	▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_DAY	41519	0.86	0.01	0.11	0.00	0.00	0.00	0.00	9.00	▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_WEEK	41519	0.86	0.03	0.20	0.00	0.00	0.00	0.00	8.00	▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_MON	41519	0.86	0.27	0.92	0.00	0.00	0.00	0.00	27.00	▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_QRT	41519	0.86	0.27	0.79	0.00	0.00	0.00	0.00	261.00	▇▁▁▁▁
AMT_REQ_CREDIT_BUREAU_YEAR	41519	0.86	1.90	1.87	0.00	0.00	1.00	3.00	25.00	▇▁▁▁▁

Handling Missing data

Amount Required Credit Buereau with time distinction

I want to combine some similar sections of the data to see if there are a grouped way of handling the missing data.

# combining all columns then save to a data frame 
amt_req <- application_train %>%
  select(AMT_REQ_CREDIT_BUREAU_DAY, AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_MON, 
         AMT_REQ_CREDIT_BUREAU_QRT, AMT_REQ_CREDIT_BUREAU_WEEK, AMT_REQ_CREDIT_BUREAU_YEAR) %>%
  head(15)

# list unique values
list_unique_values <- list(
  AMT_REQ_CREDIT_BUREAU_DAY = unique(application_train$AMT_REQ_CREDIT_BUREAU_DAY),
  AMT_REQ_CREDIT_BUREAU_HOUR = unique(application_train$AMT_REQ_CREDIT_BUREAU_HOUR),
  AMT_REQ_CREDIT_BUREAU_MON = unique(application_train$AMT_REQ_CREDIT_BUREAU_MON),
  AMT_REQ_CREDIT_BUREAU_QRT = unique(application_train$AMT_REQ_CREDIT_BUREAU_QRT),
  AMT_REQ_CREDIT_BUREAU_WEEK = unique(application_train$AMT_REQ_CREDIT_BUREAU_WEEK),
  AMT_REQ_CREDIT_BUREAU_YEAR = unique(application_train$AMT_REQ_CREDIT_BUREAU_YEAR)
)

list_unique_values

## $AMT_REQ_CREDIT_BUREAU_DAY
##  [1]  0 NA  1  3  2  4  5  6  9  8
## 
## $AMT_REQ_CREDIT_BUREAU_HOUR
## [1]  0 NA  1  2  3  4
## 
## $AMT_REQ_CREDIT_BUREAU_MON
##  [1]  0 NA  1  2  6  5  3  7  9  4 11  8 16 12 14 10 13 17 24 19 15 23 18 27 22
## 
## $AMT_REQ_CREDIT_BUREAU_QRT
##  [1]   0  NA   1   2   4   3   8   5   6   7 261  19
## 
## $AMT_REQ_CREDIT_BUREAU_WEEK
##  [1]  0 NA  1  3  2  4  5  6  8  7
## 
## $AMT_REQ_CREDIT_BUREAU_YEAR
##  [1]  1  0 NA  2  4  5  3  8  6  9  7 10 11 13 16 12 25 23 15 14 22 17 19 18 21
## [26] 20

This column is described as the “Number of inquiries to the Credit Bureau about the client one {timeframe} before the application.” Given the significant number of NA values, alongside the valuable information contained within the non-NA values, it’s crucial to carefully consider how to handle this data. The presence of NA values raises questions, yet the data without NAs remains informative. It seems that the NAs might not be inherently informative, and straightforward imputation using the median or mean might not necessarily be the right answer.

Several theories could explain the occurrence of NAs:

Lack of Permission or Access: The applicant may not have authorized the lender to make these inquiries, or the lender might not have initiated an inquiry during that period.

-Data Collection Issues: Problems in the data collection or reporting process could lead to missing values.

-Applicability: This type of credit inquiry might not have been applicable or required for some applicants, resulting in fields being left blank or marked as NaN.

-Technical Errors: Issues in data processing, transfer, or storage could result in missing values.

Observing the data reveals 41,519 NAs across these columns, suggesting a substantial amount of missing information. However, the uniformity in the number of NAs across different columns suggests it might not be coincidental, indicating it could be the same individuals who either did not respond or were unable to respond. This pattern hints at potential issues with permission or applicability, where inquiries were possibly made selectively based on specific criteria. Given the considerable number of responses and the apparent non-randomness of the NAs, it may be prudent to exclude these missing values from certain analyses. Imputing these missing values could introduce bias, especially if the NAs result from systematic differences in the applicability of inquiries or permission granted by the applicants.

Impact of Imputation

Descriptive Stat Columns

There is a difference in the descriptive stat colums, such as the ones that include median, mode, and average. With these stats, there doesnt seem to be a connection in terms of numbers of NA’s like the previous category. For these they will take more thought to what to do than to remove them from the dataset.

Lets chart them and see what happens.

-AVG:

# Select collumns 
avg <- application_train %>%
  select(APARTMENTS_AVG, BASEMENTAREA_AVG, YEARS_BEGINEXPLUATATION_AVG,
         YEARS_BUILD_AVG, COMMONAREA_AVG, ELEVATORS_AVG, ENTRANCES_AVG,
         FLOORSMAX_AVG, FLOORSMIN_AVG, LANDAREA_AVG)

# Reshape for faceting
data_long <- pivot_longer(avg, 
                          cols = everything(), 
                          names_to = "Variable", 
                          values_to = "Value")

# Plot the distributions 
p <- ggplot(data_long, aes(x = Value)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  facet_wrap(~ Variable, scales = "free_x") +
  theme_minimal() +
  labs(x = "Value", y = "Count") +
  ggtitle("Distribution of Average Variables")

# Print the plot
print(p)

Seems like most of these distributions have a skew. With the skew, might make sense to impute a median because mean would be sensitive to the larger values. I think because there is no clear understanding of why these are NA’s it makes sense to impute rather than blindly remove them from the dataset.

-Median:

# Select columns 
medi <- application_train %>%
  select(APARTMENTS_MEDI, BASEMENTAREA_MEDI, YEARS_BEGINEXPLUATATION_MEDI,
         YEARS_BUILD_MEDI, COMMONAREA_MEDI, ELEVATORS_MEDI, ENTRANCES_MEDI,
         FLOORSMAX_MEDI, FLOORSMIN_MEDI, LANDAREA_MEDI,
         LIVINGAPARTMENTS_MEDI, LIVINGAREA_MEDI, NONLIVINGAPARTMENTS_MEDI,
         NONLIVINGAREA_MEDI)

# Reshape for faceting
data_long_medi <- pivot_longer(medi, 
                          cols = everything(), 
                          names_to = "Variable", 
                          values_to = "Value")

# Plot distributions 
p_medi <- ggplot(data_long_medi, aes(x = Value)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  facet_wrap(~ Variable, scales = "free_x", ncol = 3) + 
  theme_minimal() +
  labs(x = "Value", y = "Count") +
  ggtitle("Distribution of Median Variables")

# Print the plot
print(p_medi)

Same thing with the medians minus maybe years_build_medi.

-Mode

character_columns <- names(application_train)[sapply(application_train, is.character)]
print(character_columns)

##  [1] "NAME_CONTRACT_TYPE"         "CODE_GENDER"               
##  [3] "FLAG_OWN_CAR"               "FLAG_OWN_REALTY"           
##  [5] "NAME_TYPE_SUITE"            "NAME_INCOME_TYPE"          
##  [7] "NAME_EDUCATION_TYPE"        "NAME_FAMILY_STATUS"        
##  [9] "NAME_HOUSING_TYPE"          "OCCUPATION_TYPE"           
## [11] "WEEKDAY_APPR_PROCESS_START" "ORGANIZATION_TYPE"         
## [13] "FONDKAPREMONT_MODE"         "HOUSETYPE_MODE"            
## [15] "WALLSMATERIAL_MODE"         "EMERGENCYSTATE_MODE"

head(application_train$WALLSMATERIAL_MODE)

## [1] "Stone, brick" "Block"        ""             ""             ""            
## [6] ""

# Select columns 
mode <- application_train %>%
  select(APARTMENTS_MODE, BASEMENTAREA_MODE, YEARS_BEGINEXPLUATATION_MODE,
         YEARS_BUILD_MODE, COMMONAREA_MODE, ELEVATORS_MODE, ENTRANCES_MODE,
         FLOORSMAX_MODE, FLOORSMIN_MODE, LANDAREA_MODE,
         LIVINGAPARTMENTS_MODE, LIVINGAREA_MODE, NONLIVINGAPARTMENTS_MODE,
         NONLIVINGAREA_MODE, TOTALAREA_MODE)

# Reshape for faceting
data_long_mode <- pivot_longer(mode, 
                          cols = everything(), 
                          names_to = "Variable", 
                          values_to = "Value")

# Plot the distributions 
p_mode <- ggplot(data_long_mode, aes(x = Value)) +
  geom_histogram(bins = 30, na.rm = TRUE) +
  facet_wrap(~ Variable, scales = "free_x", ncol = 3) + 
  theme_minimal() +
  labs(x = "Value", y = "Count") +
  ggtitle("Distribution of Mode Variables")

# Print the plot
print(p_mode)

So minus the columns that have character variables which I believe I am not going to use, these are once again the same concept. I am thinking median over mean for imputation.

Data Questions

Target Variable Information

Let’s take a look at the Target variable and see the distribution

First, lets graph it.

# give percentages
percentage_data <- application_train %>%
  count(TARGET) %>% #count of occurances 
  mutate(Percentage = n / sum(n) * 100) # percentage

# bar chart
percentage_data %>%
  ggplot(aes(x = as.factor(TARGET), y = Percentage, fill = as.factor(TARGET))) +
  geom_bar(stat = "identity") +
  labs(x = "Target", y = "Percentage", fill = "Target") +
  scale_y_continuous(labels = scales::percent_format()) +
  geom_text(aes(label = sprintf("%.2f%%", Percentage)), vjust = -0.5, size = 3.5) +
  theme_minimal() +
  theme(legend.position = "none")

1 are people with payment issues including late payments for x amount of days. 0 means all other cases. This shows that the majority of people actually applying have pretty good financial backing. This is an imbalance in the target variable, which is good for the customers, but not maybe perfect for the model. The imbalance suggests that any predictive models built to predict this target will need to be carefully calibrated to handle this imbalance. Otherwise, the model might be biased towards predicting the majority class. We might need to oversample the minority 1 class and undersample the majority 0 variable to make up for it. This is not all bad though, the clear differentiation between the two classes suggests that the target variable is well-defined and likely to be a reliable label for modeling purposes.

Demographic Analysis:

Are there demographic factors (age, gender, income level, etc.) that are associated with a higher risk of default?

Age

# age days to years convert
application_train$AGE_YEARS <- -application_train$DAYS_BIRTH / 365

# Plot age vs target
ggplot(application_train, aes(x = AGE_YEARS, fill = as.factor(TARGET))) +
  geom_histogram(bins = 50, position = 'identity', alpha = 0.5) +
  labs(x = "Age (years)", y = "Count", fill = "Target") +
  ggtitle("Age Distribution by Target") +
  theme_minimal()

The histogram indicates that the majority of loan applicants are in their late 20s to early 40s. There is a noticeable decline in the number of applicants as age increases, particularly beyond 50 years. The overlay of the two histograms shows that defaults (Target 1) occur across all ages but are more concentrated in the younger age brackets. The peak for defaults appears to be among the younger applicants, with the proportion of defaults gradually decreasing as age increases.

Income Level

# plot the income level against the target 
ggplot(application_train, aes(x = AMT_INCOME_TOTAL, fill = as.factor(TARGET))) +
  geom_histogram(bins = 50, position = 'identity', alpha = 0.5) +
  labs(x = "Total Income", y = "Count", fill = "Target") +
  ggtitle("Income Distribution by Target") +
  theme_minimal()

This is really being affected by large outliers. There is a grouping that is making this chart unreadable and tells us nothing. For the sake of visualization and getting a deeper understanding, lets try to do it with a logarithmic scale.

ggplot(application_train, aes(x = log1p(AMT_INCOME_TOTAL), fill = as.factor(TARGET))) +
  geom_histogram(position = 'identity', alpha = 0.5, bins = 100) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Log(Total Income)", y = "Count", fill = "Target") +
  ggtitle("Income Distribution by Target (Log Scale)") +
  theme_minimal()

This looks much easier to read. Clearly, there is a rightward skew hinting that the majority of the applicants with higher risk are on the lower end of the economic spectrum. This would conceptually make sense as wealthier people probably would have at least enough line of credit to go through a standard loaning agency and certainly have the money to pay back their loans.

Conclusion

Throughout this exploratory data analysis, we should be able to create a decent model for candidates with lower credit histories. The investigation revealed key insights, particularly in demographic factors such as age and income, and their relationship with loan repayment difficulties. There is a clearly observed higher risk of default among younger applicants and identified income as a critical factor, with lower-income applicants more likely to default. These findings align with the business objective of Home Credit to empower financially underrepresented individuals while mitigating loan default risks. The pronounced class imbalance in the target variable has highlighted the necessity for advanced modeling techniques to ensure fair and accurate predictions. Moving forward, I recommend employing strategies like re-sampling or model re-weighting to address the imbalance issue. Additionally, further analysis could incorporate external economic factors to refine the predictions and support Home Credit in achieving financial inclusion without compromising on the integrity of the loan process.

EDA

Scott Silverstein

2024-02-09