Predicting Loan Defaults Using Logistic Regression
What characteristics influence the probability that a loan will be defaulted?
Introduction
Traditionally, loaning has been based on the foundation of trust. Although there were credit reports before 1989, when the FICO Score was created according to myFICO.com, the money lending process was fairly subjective, and potential borrowers were often judged by how trustworthy their character seemed. Today, lenders are able to use tools like FICO Scores to quantify how trustworthy potential borrowers are, minimizing randomness. All of this is done for one purpose: to determine how likely it is that a given borrower will default a loan.
Predicting default rates is a significant part of money-lending because lenders must predict whether giving out a loan will result in profit or loss. Normally, loans are profitable because of interest, but sometimes a borrower will default, which is both a betrayal of the moneylender’s trust and a hazard to the moneylender’s business. Thus, it is important that the lender is able to gauge the likelihood of a borrower defaulting before making a loan to him/her.
Given the high number of factors that might affect borrower default rate, it may be infeasible to come up with good estimates heuristically or by hand. The goal of this project is to explore whether or not we can employ statistical and machine learning models to better predict the risk of borrower default. By analyzing variables that describe loans and the financial situations of their borrowers, we may determine key relationships between default rates and a few other variables. Along the way, we will look into key relationships between loan default chances, loan characteristics, and buyer behaviors.
Data Description
For this project, we will use anonymized data from a lending company. The data contains historical information on details of the loan itself and characteristics of the lender. Some feature names are also anonymized to protect sensitive information. Of the variables in the original data file, we will target the following variables as points of interest:
Default: This variable is binary and represents whether or not the buyer defaulted on the loan. Default rates will be the focus of this project because we want to analyze how they could be related to other variables. The data set contains 1000 loans that had been defaulted and 2000 that had not. In reality, only around 7% of loans were defaulted on, but we upsample this group to better extract signals on what might lead to loan default.
Reason: This categorical variable represents the reason the loan was taken out. Reasons for taking out a loan have been coded as the following: for the purchase of a boat, for a business, for credit cards, for an event, for a holiday, for the purchase of a home, for medical bills, for home relocation, for home renovation, for the installation of solar panels, for transport, and for other reasons.
Amount: This continuous variable represents the amount of money that was taken out as the loan.
Annual Income: This continuous variable represents the amount of money that the borrower earned last year.
Interest: This variable represents the amount of interest charged on the loan.
Term: This variable represents the length of time the loan lasts. In this data set, loan terms are either 3 or 5 years.
Employment: This variable represents the length of time the borrower has been employed. In this data set, this variable is categorical, ranging from < 1 year to 1 year to 10+ years.
Credit Balance: This continuous variable represents the amount of money that the borrower spent on credit last year.
Credit Ratio: This continuous variable is the proportion of credit the borrower has used up to the credit line. Values are expressed as percentages, so the ratio is multiplied by 100. Although credit used up should not surpass the credit line, a few borrowers have credit ratios greater than 100.
v5 and v6 are anonymized continuous variables.
Exploratory Data Analysis
Independent Variables
We will first look at the distributions of and between some characteristics of the loan or the borrower of the loan. This will help us determine which predictor variables may have interesting patterns and where we should be concerned about multicollinearity, which is when the model breaks down because multiple variables are too correlated.
It is interesting to note that the values of interest rate and loan amount have varied frequency, with some values occurring over 30 times in the data set and others occurring only once (Figure 1 & 2). This is probably because some interest rates and amounts are more popular as parts of standard loan packages.
It is clear in Figure 3 that debt and credit cards are the most common reasons that borrowers take out loans. This is probably because people take out loans to pay off pre-existing debts or to pay off credit card bills. It is important to note that there are 6 categories with sample sizes of less than 30, so if we intend to use reasons in our model, we should be cautious about them due to high variability.
Figure 4 shows that there are far more short-term loans than long-term loans. Both loan terms have enough entries that sample size is not a concern.
The relationship between credit ratio and credit balance (Figure 5) is positive and linear but not very strong. This makes sense intuitively because people who spend more on credit are also likely to be closer to maxing out their credit limits, thus having a higher credit ratio However, this relationship is not extremely strong, so we will be able to include both variables in the model without worrying about multicollinearity. In Figure 5, the points in red are visual outliers, where the credit balance is over 9000 dollars greater than 1200 times the credit ratio.
From Figure 6, it appears that credit balance is not significantly affected by how long the borrower was employed. It is interesting to note that credit balance is slightly higher for borrowers who have been employed for at least 10 years, which makes sense, as people with more consistent incomes have more spending freedom.
While the distribution of v6 seems normally distributed (Figure 7), the distribution of v5 is strongly skewed to the right. There does not appear to be a relationship between the two variables, so we can use both of them in a model without worrying about multicollinearity. The points in red are visual outliers.
There is a positive linear relationship between loan amount and v5 (Figure 8), but it is relatively weak. This may be because v5 is a variable that depends on or is related to the loan amount. The points in red are visual outliers, where v5 is over 500 units greater than 0.35 times the loan amount.
Relationship to Defaults
The following bar graphs explore the correlations between some loan/borrower characteristics and whether or not the loan was defaulted on. The following characteristics seem to have the most influence on default rates.
Looking at Figure 9, we observe that as interest rate increases, so does the average default rate. This makes sense, because higher interest rate means the loan is harder to pay back. It is also worth noting that the 30 loans with interest rates from 3% to 6% do not appear on the graph because none of them were defaulted.
Average default rates generally decrease as credit balance increases. This may be because credit balance is correlated with socioeconomic status, so high spenders are also more capable of paying off loans. A brief look at loans for borrowers with credit balances above 70k shows that default rates continue to decrease as credit balance increases, although the sample sizes are small so we need to be careful about our evaluations.
Between credit balances of 55k and 70k (Figure 10), average default rates get very high. This may be because people overspend and cannot pay back their loans. However, the sample sizes in that range are also very small, so further research would be required to make a conclusion.
Overall, default rates appear to slowly increase as credit ratio increases (Figure 11). When we compare low credit ratio loans with high credit ratio loans in Figure 12, it is clear that borrowers with low credit ratio tend to default less. This may be because people who are cautious about spending are more responsible about loans. One thing to note is that borrowers within our data set with credit ratios above 110 always default. A borrower with a credit ratio above 100 has overcharged his/her credit card, so it makes sense for the borrower to be equally irresponsible with loans or less able to pay back loans due to other outstanding debts. However there are also few samples in this category, so we must be careful not to overfit the model.
Default rates appear to be higher in the middle range of the anonymized variable v5 (Figure 13). Also, the default rate for loans with v5 between 900 and 1000 seems to be out-of-place. This may have something to do with what the variable represents, but it could also be because the sample size is small.
Generally, there is a slight downward trend in average default rates as annual income increases (Figure 14). However, defaults seem to spike for borrowers with annual incomes of around $80,000. It is unclear why this is.
For employment, it appears that default rates are highest for the recently employed and surprisingly also those who have already been employed for a while. The higher default rates in later years may be because people take out loans when they start a new job — perhaps upon graduation or when moving to a new city — and the loans are not due until 3 or 5 years later.
Methods
We want to focus on the impact of different loan/borrower characteristics on the probability of default. Since default is a binary variable — loans are either defaulted or not defaulted — we will use logistic regression to build a model. The formula for logistic regression is
where p is the probability that the target variable is 1 (loan defaulted), and the variables on the right side are predictor variables. Continuous predictor variables contribute one independent variable to the equation, while categorical variables may be slightly more complicated. For example, if given a variable with four categories, one category becomes the base, while the other three contribute three binary, mutually exclusive independent variables.
To evaluate the accuracy of these logistic regression models, we will analyze AUC, AIC, predicted accuracy, and weighted accuracy. AUC measures the area under the ROC Curve; thus, predicting true positives more accurately in the model will maximize it. The Akaike information criterion (AIC) approximates the difference between the predicted model and a true model, so a lower AIC suggests better accuracy. The basic formula for AIC is
We will also compare predicted accuracy by calculating the proportion of loans that were accurately predicted to have been defaulted/not defaulted. However, the data set did not accurately reflect the distribution of defaulted loans in reality, since the proportion of defaulted loans in the data set was approximately 33% while the proportion of defaulted loans in reality is approximately 7%. Weighted accuracy accommodates for this imbalance by putting more value in defaulted loans that are predicted accurately. The formula for calculating weighted accuracy is as follows:
We will also cross-validate our models to ensure that the model can adapt to different loan data sets. Using a train-test split at an 80:20 ratio will give the model enough data to train with while still leaving some for it to test with.
To evaluate how effective our models really are, we will compare the models built with a null, or “coin toss,” model. This model randomly predicts defaults for loans based on the proportion of defaulted loans in the data set. Comparing the null model with other models will help us gauge the impact of predictor variables.
After evaluating different models that used different predictor variables, I noticed that of all the independent variables, interest predicted default rates most accurately. Thus, interest rate was used to predict default rates for all the models included in the results. Other characteristics of loans or borrowers of loans that proved to be useful for predicting default were annual income and loan amount.
Results
The first two models in the table (Figure 16), Models 1 and 2, were simplistic models to start off with. The next four, Models 3 through 6, were more complex models that performed slightly better according to the evaluation metrics. The last model is a random “coin toss” model that predicted around 1 defaulted loan for every 2 loans not defaulted.
The borrower’s length of employment, reason for borrowing, credit ratio, and credit balance were categorized differently in some of the models. In Model 3, for example, the employment predictor variable remained unchanged, so that there were 11 different categories ranging from less than a year of employment to at least 10. In Model 5, these categories were grouped into 3 categories — less than 3 years, 3 to 9 years, and at least 10 years — while in Model 6, they were grouped into 4 categories — less than a year, 1 to 4 years, 5 to 8 years, and at least 9 years.
Also in Model 6, the “reasons” independent variable narrowed the many different reasons for taking out a loan down to business, renovation, cc, debt, and all others. The “high_bal” variable was binary, true for any borrower with a credit balance above $15,000, and the “high_ratio” variable was binary and true for any borrower with a credit ratio above 60%.
While we do see the null model performing better in terms of weighted accuracy, our models, Models 1 through 6, have higher AUCs and are around 20% better in terms of actual (predictive) accuracy, so ultimately, our efforts in modeling paid off. Figure 17 is a graph of the AUC curves for Model 1, Model 4, Model 6, and the null model. The closer the curve is to the top left area, the greater its AUC, and thus, the better it performs. Even the simplest models like Model 1 seem to perform drastically better than the null model.
Although Model 5 tended to be the most accurate of all the models listed, I would say that the best model is Model 4, which uses amount, income, interest, term, and an interaction between interest and amount to predict default rates. It is nearly as accurate as Model 5 and even performed better in terms of weighted accuracy, but it is simpler. Additionally, all the variables had significant effects (p < 0.05) on the default rate, so this model is both explanatory and predictive to a good degree.
The code block below shows the coefficients of predictor variables in Model 4.
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9175 -0.8867 -0.6372 1.1358 2.2330
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.398e+00 3.510e-01 -3.982 6.84e-05 ***
amount -4.156e-05 1.937e-05 -2.146 0.03191 *
interest 1.327e-01 2.108e-02 6.293 3.12e-10 ***
term -2.696e-01 5.533e-02 -4.873 1.10e-06 ***
income -3.994e-06 1.218e-06 -3.278 0.00105 **
amount:interest 2.914e-06 1.196e-06 2.436 0.01486 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3615.9 on 2839 degrees of freedom
Residual deviance: 3297.6 on 2834 degrees of freedom
AIC: 3309.6
Number of Fisher Scoring iterations: 4
At the intercept, when amount, interest, and income are (hypothetically) $0, and the term is 3 years long, the log odds is −1.398. This means the odds of defaulting are 0.247. When the term of the loan is 5 years instead of 3, the log odds decreases by 0.270, so the odds of defaulting decrease by 23.6%. It seems that a borrower is more likely to default on a shorter loan than on a longer one. When income is $10,000 higher, the odds of defaulting decrease by 3.9%.
When interest is fixed at a constant percentage, and the amount of the loan increases by $1,000, the associated log odds are expected to decrease by 0.0416, so the odds of defaulting decrease by 4.07%. However, if interest is raised as well as amount, then the log odds of default increases by 2.914 ⋅ 10⁻⁶ for every additional unit increase in amount. For example, a $1,000 increase in amount would decrease log odds by 0.0416, but a $1,000 increase in amount alongside an increase in interest would decrease log odds by
0.0416 − 0.0029 = 0.0387, so the odds of defaulting would decrease by 3.80%. In other words, as loan interest starts to rise, the lowering effect of higher loan amounts on default rates starts to diminish.
Conversely, if loan amount increases, the log odds of default increases by 2.914 ⋅ 10⁻⁶ for every additional unit increase in interest. When the loan amount is fixed and the interest rate increases by 1 percent, the log odds are expected to increase by 0.133, so the odds of defaulting increase by 14.2%. However, if loan amount is rising, interest rate increasing by 1 percent would cause the log odds to increase by 0.133 + 2.914 ⋅ 10⁻⁶ ≈ 0.133.
Discussion
In this project, I was only able to examine basic predictor variables such as interest and amount. I did not find many more patterns between other variables, but I would be interested in studying the other variables more in-depth. Anonymized variables specifically were mostly skipped over, so future steps could include researching those. I would also be interested in re-analyzing the variables that I did use by splitting them into different categories. Perhaps this would help identify special patterns in the data that were not clear in previous models. Exploring how demographics and cultural background tie in to loan defaulting would also be an interesting extension of this project. Perhaps some cultures emphasize the importance of honor more than others would, thus discouraging using loans and especially defaulting on loans.
Another option for future research would be examining correlations between independent variables more. Although there were a few graphs of correlations between independent variables in the Exploratory Data Analysis, I largely focused on correlations between default rates and predictor variables in this project. Learning more about how independent variables affect each other may give insight on how they affect default rates.
Looking at the coefficients for Model 4, I was surprised that increasing the amount of the loan actually causes the odds to decrease in general. This may be because borrowers who take out larger loans are more cautious or plan it out more carefully. For example, amortization schedules for mortgages would tend to be more well-planned than repayment for a loan taken out on impulse. However, if interest is raised as well, then the lowering effect of higher loan amounts on probability of default diminishes. Thus, Model 4 implies that the ideal loan with minimal probability of default would have a large amount and a low interest rate.
It is also interesting that a longer term would cause the odds of defaulting to decrease; perhaps this is because borrowers have more time to pull themselves out of debt. Aside from those two predictor variables, interest, income, and the interaction between amount and interest all are expected because wealthier borrowers are more likely to be able to pay back a loan, and high interest loans are less likely to be paid back.
Generally, I found Model 4, which used the predictor variables of amount, interest, term, income, and an interaction between amount and interest, to perform the best because it balanced simplicity and performance. It was one of the most accurate models I was able to build, but it was not overly complicated, and every predictor variable still had a significant effect on the probability of default. In terms of future research, combining some predictor variables from Model 4 with other predictor variables that were left relatively unexplored could yield a better model.
If possible, using different types of models would also allow for different interpretations of the same variables. Logistic regression models seem to assume predictor variables have a linear or one-directional trend. Interest worked particularly well with the logistic regression models because it had such a linear relationship with default rates. However, most variables such as credit balance or loan amount are often more complicated than that. I would be interested in exploring other types of models that could reflect the more complex nature of predictor variables.
Conclusion
Through Exploratory Data Analysis, we discovered correlations in and between predictor variables that would guide us in building our model. We were able to conclude that the probability of a loan default may be predicted by loan interest rates, loan amount, and borrower income, among other factors. We also proved the credibility of our models with evaluation metrics that measured accuracy and error. The predictor variable that best suited logistic regression was interest because of its linear correlation with default. In order to further improve on this research, different predictor variables or types of models may be examined.
Appendix
Thanks for reading!