Credit Risk Modeling (details + scripts)
Data:
I use the German credit data set. The data is provided in two formats, one consists of 20 variables (7 numerical, 13 categorical) as below:
– Status of existing checking account
– Duration in month (numerical)
– Credit history
– Purpose
– Credit amount (numerical)
– Savings account/bonds
– Present employment since
– Installment rate in percentage of disposable income (numerical)
– Personal status and sex
– Other debtors / guarantors
– Present residence since (numerical)
– Property
– Age in years (numerical)
– Other installment plans
– Housing
– Number of existing credits at this bank (numerical)
– Job
– Number of people being liable to provide maintenance for (numerical)
– Telephone
– foreign worker
and the other one contains 24 numerical variables for algorithms which strictly works with numerical variables. In the both formats the default, target variable shows whether the loan applicant could make payment (0/NonDef) or not (1/Def).
After loading data sets as the data frames I check the distribution of variables and possibility of missing data. Fortunately there is no any abnormal behavior and missing data in both data sets. The summary of both formats of data sets are as following:
R> d.g <- read.table('./data/german.data') R> colnames(d.g) <- c( 'account_balance', 'months_loan_duration', 'credit_history', 'purpose', 'credit_amount', 'savings_balance', 'employment_status', 'installment_rate', 'personal_status', 'other_debtors_guarantors', 'present_residence_years', 'property', 'age', 'other_installment', 'housing', 'number_credits_this_bank', 'job', 'number_dependents', 'phone', 'foreign_worker', 'default') R> d.g$default <- factor(ifelse(d.g$default==1, 0, 1)) R> levels(d.g$default) <- c('Nondef', 'Def') R> summary(d.g) account_balance months_loan_duration credit_history purpose A11:274 Min. : 4.0 A30: 40 A43 :280 A12:269 1st Qu.:12.0 A31: 49 A40 :234 A13: 63 Median :18.0 A32:530 A42 :181 A14:394 Mean :20.9 A33: 88 A41 :103 3rd Qu.:24.0 A34:293 A49 : 97 Max. :72.0 A46 : 50 (Other): 55 credit_amount savings_balance employment_status installment_rate Min. : 250 A61:603 A71: 62 Min. :1.000 1st Qu.: 1366 A62:103 A72:172 1st Qu.:2.000 Median : 2320 A63: 63 A73:339 Median :3.000 Mean : 3271 A64: 48 A74:174 Mean :2.973 3rd Qu.: 3972 A65:183 A75:253 3rd Qu.:4.000 Max. :18424 Max. :4.000 personal_status other_debtors_guarantors present_residence_years property A91: 50 A101:907 Min. :1.000 A121:282 A92:310 A102: 4 1 1st Qu.:2.000 A122:232 A93:548 A103: 52 Median :3.000 A123:332 A94: 92 Mean :2.845 A124:154 3rd Qu.:4.000 Max. :4.000 age other_installment housing number_credits_this_bank Min. :19.00 A141:139 A151:179 Min. :1.000 1st Qu.:27.00 A142: 47 A152:713 1st Qu.:1.000 Median :33.00 A143:814 A153:108 Median :1.000 Mean :35.55 Mean :1.407 3rd Qu.:42.00 3rd Qu.:2.000 Max. :75.00 Max. :4.000 job number_dependents phone foreign_worker default A171: 22 Min. :1.000 A191:596 A201:963 Nondef:700 A172:200 1st Qu.:1.000 A192:404 A202: 37 Def :300 A173:630 Median :1.000 A174:148 Mean :1.155 3rd Qu.:1.000 Max. :2.000
and the numerical format:
R> d.g.num <- read.table('../data/german.data-numeric') R> d.g.num$default <- factor(ifelse(d.g.num$V25==1, 0, 1)) R> d.g.num$V25 <- NULL R> summary(d.g.num) V1 V2 V3 V4 Min. :1.000 Min. : 4.0 Min. :0.000 Min. : 2.00 1st Qu.:1.000 1st Qu.:12.0 1st Qu.:2.000 1st Qu.: 14.00 Median :2.000 Median :18.0 Median :2.000 Median : 23.00 Mean :2.577 Mean :20.9 Mean :2.545 Mean : 32.71 3rd Qu.:4.000 3rd Qu.:24.0 3rd Qu.:4.000 3rd Qu.: 40.00 Max. :4.000 Max. :72.0 Max. :4.000 Max. :184.00 V5 V6 V7 V8 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:2.000 Median :1.000 Median :3.000 Median :3.000 Median :3.000 Mean :2.105 Mean :3.384 Mean :2.682 Mean :2.845 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:3.000 3rd Qu.:4.000 Max. :5.000 Max. :5.000 Max. :4.000 Max. :4.000 V9 V10 V11 V12 Min. :1.000 Min. :19.00 Min. :1.000 Min. :1.000 1st Qu.:1.000 1st Qu.:27.00 1st Qu.:3.000 1st Qu.:1.000 Median :2.000 Median :33.00 Median :3.000 Median :1.000 Mean :2.358 Mean :35.55 Mean :2.675 Mean :1.407 3rd Qu.:3.000 3rd Qu.:42.00 3rd Qu.:3.000 3rd Qu.:2.000 Max. :4.000 Max. :75.00 Max. :3.000 Max. :4.000 V13 V14 V15 V16 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000 Median :1.000 Median :1.000 Median :1.000 Median :0.000 Mean :1.155 Mean :1.404 Mean :1.037 Mean :0.234 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:0.000 Max. :2.000 Max. :2.000 Max. :2.000 Max. :1.000 V17 V18 V19 V20 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.000 Median :0.000 Median :1.000 Median :0.000 Median :0.000 Mean :0.103 Mean :0.907 Mean :0.041 Mean :0.179 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000 Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000 V21 V22 V23 V24 default Min. :0.000 Min. :0.000 Min. :0.0 Min. :0.00 0:700 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.00 1:300 Median :1.000 Median :0.000 Median :0.0 Median :1.00 Mean :0.713 Mean :0.022 Mean :0.2 Mean :0.63 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.0 3rd Qu.:1.00 Max. :1.000 Max. :1.000 Max. :1.0 Max. :1.00
Methodology
I will apply logistic regression, decision tree and random forest on these data sets and evaluate their efficiency based on AUC (Area Under ROC), Kappa statistics and Bad Rate. Finally I will propose the best model which shows the best efficiency on the training data set.
1- Logistic Regression
First I divide the data to the test and training data sets. For this purpose I employ createDataPartition function from caret package to obtain the stratified random samples of the data. I will use the training set for training the algorithm and finding the most efficient model and keep the test data set to check the real performance of the selected model on an unseen data set.
R> prop.table(table(d.g.num$default)) 0 1 0.7 0.3 R> library(caret) R> in_train <- createDataPartition(d.g.num$default, p = 0.66, list = FALSE) R> train_set <- d.g.num[in_train,] R> test_set <- d.g.num[-in_train,] R> prop.table(table(train_set$default)) 0 1 0.7 0.3 R> prop.table(table(test_set$default)) 0 1 0.7 0.3
I start by the most general model which includes all variables then construct the next model by omitting the predictor which has the worst p-value. Since the p-values of predictors have high dependency to training data set I calculate the average p-values through cross-validation in training data set. The pvalue.coef function return the p-values of coefficients of a given model.
R> pvalue.coef <- function(mdl) + { + dum <- summary(mdl)$coefficients + dum <- dum[,'Pr(>|z|)'] + return(dum) + }
The eval_model_nonimpvar function with using cv_nonimpvar function calculates and plots the boxplot of p-values of coefficients through a repeated (100 times) 10-fold cross-validation for the given formula fml of the model.
R> cv_nonimpvar <- function(fml, folds) + { + rs <- lapply(folds, function(x) + { + train_set_train <- train_set[-x, ] + mdl <- glm(formula = fml, family = binomial(link=logit), data = train_set_train) + nonimpvar <- pvalue.coef(mdl) + return(nonimpvar) + }) + vars_name <- all.vars(fml[[3]]) + rs_matrix <- matrix(unlist(rs), nrow=length(vars_name)+1 ) + colnames(rs_matrix) <- c(paste0('Fold0',1:9),'Fold10') + rownames(rs_matrix) <- c('Intercept', vars_name) + return(rs_matrix) + }
R> n.cv <- 100 R> eval_model_nonimpvar <- function(fml) + { + t.mat <- NULL + for (i in 1:n.cv) + { + folds.new <- createFolds(train_set$default, k = 10) + cv_var_op <- cv_nonimpvar(fml, folds.new) + t.mat <- cbind(t.mat, cv_var_op) + } + dum <- apply(t.mat, 1, mean) + id <- order(dum, decreasing = T) + boxplot(t(t.mat[id,]), las=2, main=fml, ylim=c(0,.2)) + points(dum[id], pch=20, cex=1, col='red') + return(t.mat) +}
Applying for the most general model:
fml0 <- default ~ V1+V2+V3+V4+V5+V6+V7+V8+V9+V10+V11+V12+V13+V14+V15+V16+V17+V18+V19+V20+V21+V22+V23+V24 nonimpvar0.mat <- eval_model_nonimpvar(fml0)
I obtain the below results:
These results suggest to remove the V12 to form a better model (fml1). After removing it I repeat the same procedure to obtain the next worst predictor. I continue this procedure till all the predictors show the average p-value less than 0.05. This proposes 16 models (as below) which are simpler than the fml0.
#-------------------------------------------------------------------------------------------------------- fml1 <- default ~ V1+V2+V3+V4+V5+V6+V7+V8+V9+V10+V11+V13+V14+V15+V16+V17+V18+V19+V20+V21+V22+V23+V24 #12 nonimpvar1.mat <- eval_model_nonimpvar(fml1) #-------------------------------------------------------------------------------------------------------- fml2 <- default ~ V1+V2+V3+V4+V5+V6+V7+V9+V10+V11+V13+V14+V15+V16+V17+V18+V19+V20+V21+V22+V23+V24 #12, 8 nonimpvar2.mat <- eval_model_nonimpvar(fml2) #-------------------------------------------------------------------------------------------------------- fml3 <- default ~ V1+V2+V3+V4+V5+V6+V7+V9+V10+V11+V13+V14+V15+V16+V17+V18+V19+V21+V22+V23+V24 #12, 8, 20 nonimpvar3.mat <- eval_model_nonimpvar(fml3) #-------------------------------------------------------------------------------------------------------- fml4 <- default ~ V1+V2+V3+V5+V6+V7+V9+V10+V11+V13+V14+V15+V16+V17+V18+V19+V21+V22+V23+V24 #12, 8, 20, 4 nonimpvar4.mat <- eval_model_nonimpvar(fml4) #-------------------------------------------------------------------------------------------------------- fml5 <- default ~ V1+V2+V3+V5+V6+V7+V9+V10+V11+V13+V14+V15+V16+V17+V18+V19+V21+V22+V24 #12, 8, 20, 4, 23 nonimpvar5.mat <- eval_model_nonimpvar(fml5) #-------------------------------------------------------------------------------------------------------- fml6 <- default ~ V1+V2+V3+V5+V6+V7+V9+V10+V11+V13+V14+V15+V16+V17+V18+V19+V21+V22 #12, 8, 20, 4, 23, 24 nonimpvar6.mat <- eval_model_nonimpvar(fml6) #-------------------------------------------------------------------------------------------------------- fml7 <- default ~ V1+V2+V3+V5+V6+V7+V9+V10+V11+V13+V14+V15+V16+V17+V18+V19+V21 #12, 8, 20, 4, 23, 24, 22 nonimpvar7.mat <- eval_model_nonimpvar(fml7) #-------------------------------------------------------------------------------------------------------- fml8 <- default ~ V1+V2+V3+V5+V6+V7+V9+V10+V11+V13+V15+V16+V17+V18+V19+V21 #12, 8, 20, 4, 23, 24, 22, 14 nonimpvar8.mat <- eval_model_nonimpvar(fml8) #-------------------------------------------------------------------------------------------------------- fml9 <- default ~ V1+V2+V3+V5+V6+V7+V9+V10+V11+V13+V15+V16+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21 nonimpvar9.mat <- eval_model_nonimpvar(fml9) #-------------------------------------------------------------------------------------------------------- fml10 <- default ~ V1+V2+V3+V5+V6+V7+V9+V10+V11+V15+V16+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21, 13 nonimpvar10.mat <- eval_model_nonimpvar(fml10) #-------------------------------------------------------------------------------------------------------- fml11 <- default ~ V1+V2+V3+V5+V7+V9+V10+V11+V15+V16+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21, 13, 6 nonimpvar11.mat <- eval_model_nonimpvar(fml11) #-------------------------------------------------------------------------------------------------------- fml12 <- default ~ V1+V2+V3+V5+V9+V10+V11+V15+V16+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21, 13, 6, 7 nonimpvar12.mat <- eval_model_nonimpvar(fml12) #-------------------------------------------------------------------------------------------------------- fml13 <- default ~ V1+V2+V3+V5+V10+V11+V15+V16+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21, 13, 6, 7, 9 nonimpvar13.mat <- eval_model_nonimpvar(fml13) #-------------------------------------------------------------------------------------------------------- fml14 <- default ~ V1+V2+V3+V5+V11+V15+V16+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21, 13, 6, 7, 9, 10 nonimpvar14.mat <- eval_model_nonimpvar(fml14) #-------------------------------------------------------------------------------------------------------- fml15 <- default ~ V1+V2+V3+V5+V11+V15+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21, 13, 6, 7, 9, 10, 16 nonimpvar15.mat <- eval_model_nonimpvar(fml15) #-------------------------------------------------------------------------------------------------------- fml16 <- default ~ V1+V2+V3+V5+V11+V17+V18+V19 #12, 8, 20, 4, 23, 24, 22, 14, 21, 13, 6, 7, 9, 10, 16, 15 nonimpvar16.mat <- eval_model_nonimpvar(fml16) #--------------------------------------------------------------------------------------------------------
The following figure shows the p-values of the coefficient of the last model (fml16).
Having 17 models I evaluate them based on AUC (Area Under ROC). I use pROC and caret packages to calculate average AUC for models using repeated (100 times) 10-fold cross-validation.
library(pROC) cv_auc_function <- function(fml, folds) { auc_r <- lapply(folds, function(x) { train_set_train <- train_set[-x, ] train_set_test <- train_set[x, ] mdl <- glm(formula = fml, family = binomial(link=logit), data = train_set_train) pred <- predict(mdl, newdata = train_set_test, type = 'response') dum.roc <- roc(train_set_test$default, pred) as.vector(auc(dum.roc)) }) return(unlist(auc_r)) # Output is a vector like: # Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 # 0.7456522 0.8832952 0.8478261 0.8836957 0.7652174 0.6904255 0.7425532 0.7356979 0.7804348 0.7739130 } #............................................. fml_vec <- paste0('fml', 0:16) auc_matrix_cv <- NULL for (i in fml_vec) { fml <- get(i) t.vec <- NULL for ( j in 1:n.cv) { folds.new <- createFolds(train_set$default, k = 10) dum <- cv_auc_function(fml, folds.new) t.vec <- c(t.vec, dum) } auc_matrix_cv <- cbind(auc_matrix_cv, t.vec) } colnames(auc_matrix_cv) <- fml_vec auc_means <- apply(auc_matrix_cv, 2, mean) id <- order(auc_means, decreasing = T) plot(auc_means[id], xaxt='n', xlab='', ylim=c(0.782, 0.8), pch=20,cex=1.5, col='red', ylab='AUC', main='Evaluating all 16 models based on AUC via CV') #CV2.png axis(1,at=1:17,labels=names(auc_means[id]), las=2)
As we see in the results in following plot the fml10 shows the best average AUC while followed by fml13, fml12 and fml11 models.
For a bank apart from a demand for high accuracy, sensitivity and specificity of a model, a low value of the bad rate is also desired. The bad rate is defined as the ratio of false negative to the total actual negative examples. The low value of bad rate shows the less number of actual positive default cases which are wrongly predicted as the non-default cases. This reduces the risk of loss for the bank. Although a low bad rate is desirable but it must along with a proper percentage of acceptance ratio of applicants. For this project I assume a maximum of 15% bad rate and minimum 60% acceptance ratio are reasonable for a particular bank. Therefore I try to find the best model which can satisfy these criteria.
The following br.se.ac function calculate the bad rate, sensitivity, specificity, accuracy and kappa statistics in the range of 0% to 100% acceptance ratio for a particular model.
#------br.se.ac function-------------------------------------------------- br.se.ac <- function(actl_class, pred_prob) { acp_rate <- seq(0,1,by=0.01) cut_off <- NULL bad_rate <- NULL sens <- NULL spec <- NULL accu <- NULL kppa <- NULL for (i in 1:length(acp_rate)) { cutoff <- quantile(pred_prob, acp_rate[i]) pred_class <- ifelse(pred_prob> cutoff, 1, 0) pred_class <- factor(pred_class, levels=levels(actl_class)) cont_table <- table(actl_class, pred_class) TN <- cont_table[1,1] FP <- cont_table[1,2] FN <- cont_table[2,1] TP <- cont_table[2,2] br <- FN/(FN+TN) se <- TP/(TP+FN) sp <- TN/(TN+FP) ac <- (TN+TP)/(TN+FP+FN+TP) py <- ((TP+FN)/(TN+FP+FN+TP)) * ((TP+FP)/(TN+FP+FN+TP)) pn <- ((FP+TN)/(TN+FP+FN+TP)) * ((FN+TN)/(TN+FP+FN+TP)) pe <- py+pn kp <- (ac-pe)/(1-pe) cut_off <- c(cut_off, cutoff) bad_rate <- c(bad_rate, br) sens <- c(sens, se) spec <- c(spec, sp) accu <- c(accu, ac) kppa <- c(kppa, kp) } data.frame(acp_rate, cut_off, bad_rate, sens, spec, accu, kppa) }
The below function and codes calculate and plot the average bad rate vs acceptance ratio for all suggested model through repeated (100 times) 10-fold cross-validation.
cv_br_function <- function(fml, folds) { br_r <- lapply(folds, function(x) { train_set_train <- train_set[-x, ] train_set_test <- train_set[x, ] mdl <- glm(formula = fml, family = binomial(link=logit), data = train_set_train) pred <- predict(mdl, newdata = train_set_test, type = 'response') dum <- br.se.ac(train_set_test$default, pred ) return(dum$bad_rate) }) br.matrix <- matrix(unlist(br_r), ncol=10 ) colnames(br.matrix) <- c(paste0('Fold0',1:9),'Fold10') rownames(br.matrix) <- paste0(0:100,'%') return(br.matrix) } #....Calculating average bad rate for each model via CV........... for (i in fml_vec) { print(i) t.mat <- NULL for ( j in 1:n.cv) { folds.new <- createFolds(train_set$default, k = 10) dum <- cv_br_function(get(i), folds.new) t.mat <- cbind(t.mat, dum) } dum2 <- apply(t.mat, 1, mean) dum3 <- paste0(i,'_br.cv') assign(dum3, dum2) }
#..................plotting................... #for (j in fml_vec) #{ plot(seq(0,1,by=0.01), rep(0, 101), type='n', ylim=c(0,0.3), xlab='Acceptance Rate', ylab='Bad Rate', main='Average bad rate based on CV') for (i in fml_vec) { fml <- get(i) dum3 <- paste0(i,'_br.cv') # "fml20_br.cv" lines(seq(0,1,by=0.01), get(dum3), type='o', col='grey', pch=20) } col_vec <- c('blue', 'black', 'darkgreen', 'red', 'orange','brown', 'skyblue', 'red') cnt <- 0 for (i in c('fml10', 'fml13', 'fml12', 'fml11', 'fml0')) { cnt <- cnt+1 dum3 <- paste0(i,'_br.cv') lines(seq(0,1,by=0.01), get(dum3), type='o', col=col_vec[cnt], pch=20) # CV3.png } abline(h=0.1, lty=2) abline(h=0.15, lty=2) text(0.05, 0.3, 'fml10', cex=1.3) text(0.05, 0.29, 'fml13', cex=1.3) text(0.05, 0.28, 'fml12', cex=1.3) text(0.05, 0.27, 'fml11', cex=1.3) text(0.05, 0.26, 'fml0', cex=1.3) points(0.02, 0.3, pch=20, cex=1.3, col='blue') points(0.02, 0.29, pch=20, cex=1.3, col='black') points(0.02, 0.28, pch=20, cex=1.3, col='darkgreen') points(0.02, 0.27, pch=20, cex=1.3, col='red') points(0.02, 0.26, pch=20, cex=1.3, col='orange')
From the above plot we can see the fml10 and fml11 show the best performance in bad rate between 10% to 15%. Since the fml10 has already shown the better AUC I select the fml10 as the best logistic model. It worth note that this model is selected based on the training data set. Its actual performance will be checked in the last section of this report on unseen test data set.
2- Decision Tree
Similar to logistic regression first I split the data to training and test sets.
R> in_train <- createDataPartition(d.g$default, p = 0.66, list = FALSE) R> train_set <- d.g[in_train,] R> test_set <- d.g[-in_train,] R> prop.table(table(train_set$default)) Nondef Def 0.7 0.3 R> prop.table(table(test_set$default)) Nondef Def 0.7 0.3
I use rpart & rpart.plot packages for applying decision tree algorithm and plotting trees. First I construct the trees with different complexity parameter values and choose the proposed trees based on the minimum average cross-validate error.
R> library(rpart) R> library(rpart.plot) R> xerr.matrix <- NULL R> for ( i in 1:1000) + { + fml1 <- rpart(default ~ ., method = "class", data = train_set, control = rpart.control(cp = 0.0001)) + dum <- printcp(fml0) + xerr.matrix <- cbind(xerr.matrix, dum[,4]) + } R> rownames(xerr.matrix) <- round(dum[,1], digits=6) R> boxplot(t(xerr.matrix)) R> mean.xerr <- apply(xerr.matrix, 1, mean) R> points(1:7, mean.xerr, pch=20, col='red', cex=2)
The results are plotted below:
As we see the four trees with 0.01010101, 0.01515152, 0.01767677 and 0.02525253 CP values have the lowest errors. To have a view on this tree I construct and prune trees with these CP values and plot them:
R> id <- which.min(mean.xerr) R> cp.best.xrr <- dum[id,1] #[1] 0.01010101 R> cp.table.sample <- dum R> fml2 <- prune(fml0, cp = cp.best.xrr) # 0.01010101 R> fml3 <- prune(fml0, cp = dum[4,1]) # 0.01515152 R> fml4 <- prune(fml0, cp = dum[3,1]) # 0.01767677 R> fml5 <- prune(fml0, cp = dum[2,1]) # 0.02525253
The prp function is used to plot trees:
R> for (i in paste0('fml', 1:5)) + { + png(file=paste0(i,'.png'), width = 1500, height = 800) + prp(get(i), main=i) + dev.off() + }
In the below figures you can see and compare the fml1 and fml5:
and
I also select the trees based on the average AUC value using train function in caret package. To speed up the computation I construct a parallel backend using doMC package:
R> library(doMC) R> registerDoMC(cores = 9) R> tc <- trainControl(method = "repeatedcv", number = 10, repeats = 100, classProbs = TRUE, summaryFunction=twoClassSummary) R> rpart.grid <- expand.grid(.cp=c(1e-7,0.0001,0.0004, 0.0005, seq(0.001, 0.03, by=0.001), seq(0.04,0.06, by=0.005)) ) R> train.rpart.rcv <- train(default ~ ., data=train_set, method="rpart", trControl=tc, tuneGrid=rpart.grid, metric='ROC') R> registerDoSEQ()
The below figure shows the result:
As we can see the trees with CP values between 0.002 to 0.007 show the best AUC efficiency. Totally I have 10 proposed trees with different CP values.
Following the similar procedure in evaluation of logistic models, I evaluate 10 proposed trees based on their bad rate and other statistics through below codes and functions:
#------br.se.ac function-------------------------------------------------- R> br.se.ac <- function(actl_class, pred_prob) + { + acp_rate <- seq(0,1,by=0.01) + cut_off <- NULL + bad_rate <- NULL + sens <- NULL + spec <- NULL + accu <- NULL + kppa <- NULL + for (i in 1:length(acp_rate)) + { + cutoff <- quantile(pred_prob, acp_rate[i]) + pred_class <- ifelse(pred_prob> cutoff, 'Def', 'Nondef') + pred_class <- factor(pred_class, levels=levels(actl_class)) + cont_table <- table(actl_class, pred_class) + TN <- cont_table[1,1] + FP <- cont_table[1,2] + FN <- cont_table[2,1] + TP <- cont_table[2,2] + br <- FN/(FN+TN) + se <- TP/(TP+FN) + sp <- TN/(TN+FP) + ac <- (TN+TP)/(TN+FP+FN+TP) + py <- ((TP+FN)/(TN+FP+FN+TP)) * ((TP+FP)/(TN+FP+FN+TP)) + pn <- ((FP+TN)/(TN+FP+FN+TP)) * ((FN+TN)/(TN+FP+FN+TP)) + pe <- py+pn + kp <- (ac-pe)/(1-pe) + cut_off <- c(cut_off, cutoff) + bad_rate <- c(bad_rate, br) + sens <- c(sens, se) + spec <- c(spec, sp) + accu <- c(accu, ac) + kppa <- c(kppa, kp) + } + data.frame(acp_rate, cut_off, bad_rate, sens, spec, accu, kppa) + } #----cv_br_function ------------------------------------------------------------ R> cv_br_function <- function(cp_par, folds) + { + br_r <- lapply(folds, function(x) + { + train_set_train <- train_set[-x, ] + train_set_test <- train_set[x, ] + mdl <- rpart(default ~ ., method = "class", data = train_set_train, control = rpart.control(cp = cp_par)) + pred <- predict(mdl, newdata = train_set_test, type='prob') + dum <- br.se.ac(train_set_test$default, pred[,2] ) + return(dum$bad_rate) + }) + br.matrix <- matrix(unlist(br_r), ncol=10 ) + colnames(br.matrix) <- c(paste0('Fold0',1:9),'Fold10') + rownames(br.matrix) <- paste0(0:100,'%') + return(br.matrix) + } #....Calculating average bad rate for each tree via repeated CV........... R> n.cv <- 100 R> cp_vec <- c( seq(0.002, 0.007, by=0.001), 0.01010101, 0.01515152, 0.01767677, 0.02525253) R> for (cp_p in cp_vec) + { + print(cp_p) + t.mat <- NULL + for ( j in 1:n.cv) + { + folds.new <- createFolds(train_set$default, k = 10) + dum <- cv_br_function(cp_p, folds.new) + t.mat <- cbind(t.mat, dum) + } + dum2 <- apply(t.mat, 1, mean) + dum3 <- paste0('fml_cp_',cp_p,'_br.cv') # e.g. "fml_cp_0.002_br.cv" + assign(dum3, dum2) + } #..................plotting................... plot(seq(0,1,by=0.01), rep(0, 101), type='n', ylim=c(0,0.3), xlab='Acceptance Rate', ylab='Bad Rate', main='Average bad rate based on CV') for (cp_p in cp_vec) { dum3 <- paste0('fml_cp_',cp_p,'_br.cv') # "fml_cp_0.002_br.cv" lines(seq(0,1,by=0.01), get(dum3), type='o', col='grey', pch=20) }
We can see the high bad rate of all 10 proposed trees. In the next section I employ and train the random forest algorithm on the data.
3. Random Forest
Using caret package I train random forest on training data set considering AUC for different mtry values (number of features which are randomly selected at each split) in a repeated cross-validation method. I also use a parallel backend to speed up the computation.
library(caret) library(doMC) registerDoMC(cores = 12) tc <- trainControl(method = "repeatedcv", number = 10, repeats = 100, classProbs = TRUE, summaryFunction=twoClassSummary) rf_grid <- expand.grid(.mtry = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 20)) m_rf <- train(default ~ ., data = train_set, method = "rf", metric = "ROC", trControl = tc, tuneGrid = rf_grid) registerDoSEQ()
I plot the results as below:
plot(m_rf$results$mtry, m_rf$results$ROC, typ='b', xaxt='n', xlab='mtry', ylab='AUC') axis(1,at= c(2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 20) , las=2,col.axis='black',col='black')
We can see the random forest corresponding to mtry values between 3 to 10 shows high values in AUC. Using br.se.ac function in previous section and below codes and functions I calculate the bad rate all proposed random forest models:
#----Evaluating and plotting of bad rate vs acceptance rate of all *RANDOM FOREST* models bulit on CV--------------------------------------------------- R> cv_br_function_rf <- function(mtry_par, folds) + { + br_r <- lapply(folds, function(x) + { + train_set_train <- train_set[-x, ] + train_set_test <- train_set[x, ] + mdl <- randomForest(default ~ ., data = train_set_train, ntree=500, mtry=mtry_par) + pred <- predict(mdl, newdata=train_set_test, type = 'prob') + dum <- br.se.ac(train_set_test$default, pred[,2] ) + return(dum$bad_rate) + }) + br.matrix <- matrix(unlist(br_r), ncol=10 ) + colnames(br.matrix) <- c(paste0('Fold0',1:9),'Fold10') + rownames(br.matrix) <- paste0(0:100,'%') + return(br.matrix) +} #....Calculating average bad rate for each model via CV........... R> library(randomForest) R> n.cv <- 100 R> mtry_vec <- 3:10 R> for (mtry_p in mtry_vec) + { + t.mat <- NULL + for ( j in 1:n.cv) + { + folds.new <- createFolds(train_set$default, k = 10) + dum <- cv_br_function_rf(mtry_p, folds.new) + t.mat <- cbind(t.mat, dum) + } + dum2 <- apply(t.mat, 1, mean) + dum3 <- paste0('fml_mtry_',mtry_p,'_br.cv') # e.g. "fml_mtry_3_br.cv" + assign(dum3, dum2) + }
The results are plotted:
R> plot(seq(0,1,by=0.01), rep(0, 101), type='n', ylim=c(0,0.3), xlab='Acceptance Rate', ylab='Bad Rate', main='Average bad rate of 10 trees (gold color) and 8 random forest models (grey)') R> for (cp_p in cp_vec) + { + dum3 <- paste0('fml_cp_',cp_p,'_br.cv') # e.g. "fml_cp_0.002_br.cv" + lines(seq(0,1,by=0.01), get(dum3), type='o', col='lightgoldenrod', pch=20) + } R> for (mtry_p in mtry_vec) + { + dum3 <- paste0('fml_mtry_',mtry_p,'_br.cv') # e.g. "fml_mtry_10_br.cv" + lines(seq(0,1,by=0.01), get(dum3), type='o', col='grey', pch=20) + } R> lines(seq(0,1,by=0.01), fml_mtry_3_br.cv, type='o', col='red', pch=20) R> abline(h=0.1, lty=2) R> abline(h=0.15, lty=2)
In above figure I have also plotted the results of trees (gold color) to compare with random forest models (gray color). We see obviously the random forest models work better than trees while the random forest with mtry=3 gives the best performance i.e. low bad rate and high acceptance rate.
Results of the final models
The fml10 logistic model and random forest with mtry=3 are the winners on training data set. To see the real performance of both models I test them on unseen test data set. The logistic fml10 model can evaluate via using br.se.ac function defined in logistic section as:
mdl <- glm(formula = fml10, family = binomial(link=logit), data = train_set) pred <- predict(mdl, newdata = test_set, type = 'response') log_fml10_results <- br.se.ac(test_set$default, pred )
Since random forest has the randomness inherent I calculate a average of performance of the model with mtry=3:
R> sum_mat <- matrix(0, nrow=101,ncol=7) R> for (i in 1:100) + { + mdl <- randomForest(default ~ ., data = train_set, ntree=500, mtry=3) + pred <- predict(mdl, newdata=test_set, type = 'prob') + dum <- br.se.ac(test_set$default, pred[,2] ) + sum_mat <- sum_mat + dum + } R> rf_mtry3_results <- sum_mat/100
Assuming maximum 15% bad rate we have:
acp_rate cut_off bad_rate sens spec accu kppa Logistic Regression 65% 0.65 0.44823798 0.14932127 0.67647059 0.789915966 0.7558824 0.445187166 Random Forest 63% 0.63 0.4042546 0.151472427 0.68137255 0.764957983 0.7398824 0.418835806
The below figure illustrates the bad rate for both models:
We see the logistic regression and random forest show the similar efficiency however logistic regression shows slightly better performance at 15% bad rate. The 65% acceptance ratio at 15% bad rate with 76% accuracy and 0.44 kappa value is a good achievement for this model.