10.1 Regression Using Tree and Random Forest Methods with UsCrime Dataset

Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can using (a) a regression tree model, and (b) a random forest model. In R, you can use the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too).

A) Regression Tree Model

uscrime_data <- read.table('C:/Users/mjpearl/Desktop/omsa/ISYE-6501-OAN/hw7/data/uscrime.txt',header = TRUE, stringsAsFactors = FALSE)
head(uscrime_data)
##      M So   Ed  Po1  Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq
## 1 15.1  1  9.1  5.8  5.6 0.510  95.0  33 30.1 0.108 4.1   3940 26.1
## 2 14.3  0 11.3 10.3  9.5 0.583 101.2  13 10.2 0.096 3.6   5570 19.4
## 3 14.2  1  8.9  4.5  4.4 0.533  96.9  18 21.9 0.094 3.3   3180 25.0
## 4 13.6  0 12.1 14.9 14.1 0.577  99.4 157  8.0 0.102 3.9   6730 16.7
## 5 14.1  0 12.1 10.9 10.1 0.591  98.5  18  3.0 0.091 2.0   5780 17.4
## 6 12.1  0 11.0 11.8 11.5 0.547  96.4  25  4.4 0.084 2.9   6890 12.6
##       Prob    Time Crime
## 1 0.084602 26.2011   791
## 2 0.029599 25.2999  1635
## 3 0.083401 24.3006   578
## 4 0.015801 29.9012  1969
## 5 0.041399 21.2998  1234
## 6 0.034201 20.9995   682
tree_model <- tree(Crime ~., uscrime_data)
summary(tree_model)
## 
## Regression tree:
## tree(formula = Crime ~ ., data = uscrime_data)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF"  "NW" 
## Number of terminal nodes:  7 
## Residual mean deviance:  47390 = 1896000 / 40 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -573.900  -98.300   -1.545    0.000  110.600  490.100

We can see with our tree output that it’s selected the Po1, Pop, LF and NW variables. For the initial summary, this seems to be the combination of factors that are the best candidates for splitting based on the default settings for the model.

plot(tree_model)
text(tree_model)

Now we can use the predict function for this tree to compare the observed values against the predicted observations, and determine the R^2 value to determine it’s relative performance.

y_pred_tree <- predict(tree_model)
plot(y_pred_tree)

#Based on source to create helper function for R2, which represents 1 - the residual sum of squares divided by the total sum of squares
rsq <- function (y_pred, y_actual) {
  residual_ss <- sum((y_pred - y_actual) ^ 2)
  total_ss <- sum((y_actual - mean(y_actual)) ^ 2) 
  rsq <- 1 - residual_ss / total_ss
  return (rsq)
}


rsq_tree <- rsq(y_pred_tree, uscrime_data$Crime)
rsq_tree
## [1] 0.7244962

Based on our results for R^2 we can see the initial results provide a good estimator for the crime variable.

We can now look at pruning the tree to work backward from the leaves to determine if the estimation error has increased from the branch, in which case we’ll remove it. We will cap our terminal node setting for “best” to 4 as we want to ensure we don’t want a tree with too high of a max depth as this will likely produce leaves with splits that result in less< 5% of total observations and ultimately, overfitting.

#Calculate the new prediction on the pruned tree with number of terminal nodes set to 4   
tree_pruned <- prune.tree(tree_model,best= 4)
plot(tree_pruned)
text(tree_pruned)

We can see from the result that we have fewer leaves, and our tree is split by a reduced number of features compared to our original tree.

y_pred_pruned <- predict(tree_pruned)
rsq_pruned <- rsq(y_pred_pruned, uscrime_data$Crime)
rsq_pruned
## [1] 0.6174017

We can see that with a lower R^2 for our pruned tree that it accounts for less of the variance compared to our original tree. However, this is likely due to overfitting, in that the original tree will likely perform better on our training dataset as there’s more features that are split on. But our pruned tree will do better on our test dataset, as it will react to newer observations with less bias.

Another exercise we could do is reduce our tree a given leaf, as it splits our data appropriately in addition to accounting for a significant amount of variance with minimal feature engineering (i.e. in comparison to our previous homework exercises.). From here we can calculate the r^2 value to determine the impact on performance.

Now we will run the random forest model on our dataset. From the lecture we determined that the appropriate number of features to use for each tree is 1 + log(n), and a number of trees ideally specified between 500-1000. This means the number of features that will be split on to form a given sub-tree, each-tree will have different permutations of both feature and dataset combinations (using boosting). The average results across all tree will be used, compared to a classifcation result which uses the mode.

num_features_rf <- 1 + log(ncol(uscrime_data))
randomForest_model_n600 <- randomForest(Crime~., data = uscrime_data, mtry = num_features_rf, importance = T, ntree = 600)
randomForest_model_n900 <- randomForest(Crime~., data = uscrime_data, mtry = num_features_rf, importance = T, ntree = 900)
randomForest_model_n600
## 
## Call:
##  randomForest(formula = Crime ~ ., data = uscrime_data, mtry = num_features_rf,      importance = T, ntree = 600) 
##                Type of random forest: regression
##                      Number of trees: 600
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 83432.16
##                     % Var explained: 43.01
randomForest_model_n900
## 
## Call:
##  randomForest(formula = Crime ~ ., data = uscrime_data, mtry = num_features_rf,      importance = T, ntree = 900) 
##                Type of random forest: regression
##                      Number of trees: 900
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 81794.19
##                     % Var explained: 44.13
plot(randomForest_model_n600)

plot(randomForest_model_n900)

From our results we can see that our error tends to flatten around 100 trees. We can see when tweaking the ntrees number that the tree flattens out earlier. This makes sense as increasing the number of trees will help our model generalize for marginally more variation.

# Importance Plot
varImpPlot(randomForest_model_n600)

We can see based on the feature importance plot, that the features used for splitting in the original tree model align well with this output, as a majority of the features used to split ([1] “Po1” “Pop” “LF” “NW”) can be seen at the top of graph. This is further clarified with the pruned result as the Pop feature is removed, and can be represented with moderate importance.

However, when looking at the results of our summary we can see that the R2 is only 42% (i.e. shown by % Var explained in the summary output) compared to our tree-based model used previously.

10.2 Real-Life Application

An example I used at work for Logistic regresison was for predicting a fraudulent customer based on various features. In this case we’re looking for a binary response of 0,1. Many factors can attribute to mortgage fraud including involved parties (i.e. brokers, lawyers), FICO score, credit scores, demographic information, etc. This can be used in production to help determine which customers are more likely for fraud, and can help minimize risk for the client.

10.3 Logistic Regression with German Credit Dataset

Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=”logit”) in your glm function call.

german_data <- read.table('C:/Users/mjpearl/Desktop/omsa/ISYE-6501-OAN/hw7/data/germancredit.txt',header = FALSE)
#Replace target variable value of 2 with 0 as we're training on a binary response variable
german_data$V21[german_data$V21 == 2] <- 0
head(german_data)
##    V1 V2  V3  V4   V5  V6  V7 V8  V9  V10 V11  V12 V13  V14  V15 V16  V17
## 1 A11  6 A34 A43 1169 A65 A75  4 A93 A101   4 A121  67 A143 A152   2 A173
## 2 A12 48 A32 A43 5951 A61 A73  2 A92 A101   2 A121  22 A143 A152   1 A173
## 3 A14 12 A34 A46 2096 A61 A74  2 A93 A101   3 A121  49 A143 A152   1 A172
## 4 A11 42 A32 A42 7882 A61 A74  2 A93 A103   4 A122  45 A143 A153   1 A173
## 5 A11 24 A33 A40 4870 A61 A73  3 A93 A101   4 A124  53 A143 A153   2 A173
## 6 A14 36 A32 A46 9055 A65 A73  2 A93 A101   4 A124  35 A143 A153   1 A172
##   V18  V19  V20 V21
## 1   1 A192 A201   1
## 2   1 A191 A201   0
## 3   2 A191 A201   1
## 4   2 A191 A201   1
## 5   2 A191 A201   0
## 6   2 A192 A201   1
sample <- sample(1:nrow(german_data), size = round(0.75*nrow(german_data)))
train <- german_data[sample,]
test <- german_data[-sample,]
logistic_full <- glm(V21 ~., family = binomial(link="logit"), train)
summary(logistic_full)
## 
## Call:
## glm(formula = V21 ~ ., family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5105  -0.6782   0.3660   0.6896   2.3010  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.926e-02  1.307e+00   0.038  0.96994    
## V1A12        4.070e-01  2.584e-01   1.575  0.11528    
## V1A13        1.100e+00  4.229e-01   2.600  0.00933 ** 
## V1A14        1.864e+00  2.770e-01   6.729 1.70e-11 ***
## V2          -3.455e-02  1.122e-02  -3.079  0.00208 ** 
## V3A31       -1.445e-01  6.393e-01  -0.226  0.82118    
## V3A32        2.744e-01  4.910e-01   0.559  0.57617    
## V3A33        6.148e-01  5.460e-01   1.126  0.26013    
## V3A34        1.065e+00  5.014e-01   2.124  0.03368 *  
## V4A41        1.846e+00  4.406e-01   4.190 2.79e-05 ***
## V4A410       1.671e+00  8.954e-01   1.867  0.06195 .  
## V4A42        7.388e-01  3.116e-01   2.371  0.01774 *  
## V4A43        7.766e-01  2.868e-01   2.708  0.00677 ** 
## V4A44        3.803e-01  8.306e-01   0.458  0.64707    
## V4A45        7.616e-01  7.960e-01   0.957  0.33869    
## V4A46        2.349e-01  4.672e-01   0.503  0.61513    
## V4A48        1.526e+01  5.385e+02   0.028  0.97739    
## V4A49        4.903e-01  3.849e-01   1.274  0.20275    
## V5          -9.748e-05  5.203e-05  -1.873  0.06101 .  
## V6A62        3.750e-01  3.290e-01   1.140  0.25443    
## V6A63        7.080e-01  4.857e-01   1.458  0.14490    
## V6A64        1.589e+00  7.021e-01   2.264  0.02360 *  
## V6A65        8.194e-01  3.037e-01   2.698  0.00697 ** 
## V7A72        5.579e-01  5.372e-01   1.039  0.29901    
## V7A73        4.869e-01  5.188e-01   0.938  0.34800    
## V7A74        1.504e+00  5.617e-01   2.677  0.00743 ** 
## V7A75        8.441e-01  5.206e-01   1.622  0.10490    
## V8          -2.628e-01  1.046e-01  -2.512  0.01200 *  
## V9A92        2.514e-01  4.423e-01   0.568  0.56973    
## V9A93        5.673e-01  4.307e-01   1.317  0.18772    
## V9A94        4.086e-01  5.318e-01   0.768  0.44233    
## V10A102      2.628e-01  5.226e-01   0.503  0.61503    
## V10A103      1.017e+00  4.725e-01   2.154  0.03128 *  
## V11         -2.425e-02  1.007e-01  -0.241  0.80971    
## V12A122     -3.659e-01  2.990e-01  -1.224  0.22110    
## V12A123     -3.231e-01  2.817e-01  -1.147  0.25140    
## V12A124     -1.179e+00  5.121e-01  -2.303  0.02126 *  
## V13          1.084e-02  1.080e-02   1.003  0.31593    
## V14A142      4.629e-02  4.791e-01   0.097  0.92303    
## V14A143      7.315e-01  2.795e-01   2.617  0.00888 ** 
## V15A152      5.622e-01  2.673e-01   2.103  0.03546 *  
## V15A153      9.865e-01  5.635e-01   1.751  0.08002 .  
## V16         -2.238e-01  2.227e-01  -1.005  0.31493    
## V17A172     -1.390e+00  8.742e-01  -1.589  0.11197    
## V17A173     -1.172e+00  8.513e-01  -1.377  0.16854    
## V17A174     -1.208e+00  8.535e-01  -1.416  0.15686    
## V18         -1.860e-01  2.905e-01  -0.640  0.52200    
## V19A192      1.472e-01  2.353e-01   0.626  0.53155    
## V20A202      1.264e+00  6.483e-01   1.950  0.05115 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 917.98  on 749  degrees of freedom
## Residual deviance: 660.12  on 701  degrees of freedom
## AIC: 758.12
## 
## Number of Fisher Scoring iterations: 14
logistics_impFeatures <- glm(V21~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V9 + +V10 + V13,
                 family=binomial(link = 'logit'),
                 data=train)
summary(logistics_impFeatures)
## 
## Call:
## glm(formula = V21 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V9 + +V10 + 
##     V13, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6721  -0.7919   0.4048   0.7125   2.0913  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.697e-01  8.870e-01  -1.093 0.274264    
## V1A12        5.067e-01  2.418e-01   2.095 0.036148 *  
## V1A13        1.187e+00  4.107e-01   2.891 0.003837 ** 
## V1A14        1.865e+00  2.630e-01   7.090 1.34e-12 ***
## V2          -4.570e-02  9.940e-03  -4.598 4.27e-06 ***
## V3A31       -4.339e-01  5.999e-01  -0.723 0.469494    
## V3A32        4.548e-01  4.577e-01   0.994 0.320360    
## V3A33        5.243e-01  5.287e-01   0.992 0.321413    
## V3A34        1.117e+00  4.846e-01   2.305 0.021149 *  
## V4A41        1.554e+00  4.127e-01   3.766 0.000166 ***
## V4A410       1.316e+00  8.975e-01   1.466 0.142629    
## V4A42        5.971e-01  2.926e-01   2.041 0.041271 *  
## V4A43        6.427e-01  2.718e-01   2.365 0.018049 *  
## V4A44        3.561e-01  8.191e-01   0.435 0.663708    
## V4A45        6.263e-01  7.253e-01   0.864 0.387832    
## V4A46       -1.268e-01  4.412e-01  -0.287 0.773817    
## V4A48        1.487e+01  5.507e+02   0.027 0.978463    
## V4A49        4.014e-01  3.654e-01   1.099 0.271943    
## V5          -4.504e-05  4.247e-05  -1.060 0.288974    
## V6A62        1.611e-01  3.048e-01   0.528 0.597229    
## V6A63        8.250e-01  4.758e-01   1.734 0.082910 .  
## V6A64        1.488e+00  6.633e-01   2.243 0.024894 *  
## V6A65        8.499e-01  2.931e-01   2.900 0.003734 ** 
## V7A72        2.147e-01  4.614e-01   0.465 0.641642    
## V7A73        2.096e-01  4.391e-01   0.477 0.633184    
## V7A74        1.172e+00  4.845e-01   2.418 0.015593 *  
## V7A75        3.573e-01  4.455e-01   0.802 0.422532    
## V9A92        4.047e-02  4.219e-01   0.096 0.923572    
## V9A93        2.645e-01  4.057e-01   0.652 0.514338    
## V9A94        3.389e-01  5.081e-01   0.667 0.504787    
## V10A102      2.127e-01  4.931e-01   0.431 0.666217    
## V10A103      1.171e+00  4.545e-01   2.578 0.009950 ** 
## V13          6.768e-03  9.706e-03   0.697 0.485644    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 917.98  on 749  degrees of freedom
## Residual deviance: 695.52  on 717  degrees of freedom
## AIC: 761.52
## 
## Number of Fisher Scoring iterations: 14
y_pred_feature_selection <- predict(logistics_impFeatures, test, type = "response")
results <- vector("list",92)

for (i in seq(8,95, by=1)){
    thresh <- i/100
    y_pred <- as.integer(y_pred_feature_selection > thresh)
    table <- as.matrix(table(y_pred, test$V21))
    cost <- table[2,1] + 5*table[1,2]
    results[i] <- cost
}

print(results)
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## [1] 78
## 
## [[9]]
## [1] 77
## 
## [[10]]
## [1] 77
## 
## [[11]]
## [1] 77
## 
## [[12]]
## [1] 76
## 
## [[13]]
## [1] 76
## 
## [[14]]
## [1] 76
## 
## [[15]]
## [1] 75
## 
## [[16]]
## [1] 75
## 
## [[17]]
## [1] 75
## 
## [[18]]
## [1] 75
## 
## [[19]]
## [1] 75
## 
## [[20]]
## [1] 74
## 
## [[21]]
## [1] 73
## 
## [[22]]
## [1] 76
## 
## [[23]]
## [1] 75
## 
## [[24]]
## [1] 74
## 
## [[25]]
## [1] 79
## 
## [[26]]
## [1] 84
## 
## [[27]]
## [1] 81
## 
## [[28]]
## [1] 80
## 
## [[29]]
## [1] 88
## 
## [[30]]
## [1] 92
## 
## [[31]]
## [1] 95
## 
## [[32]]
## [1] 105
## 
## [[33]]
## [1] 105
## 
## [[34]]
## [1] 105
## 
## [[35]]
## [1] 110
## 
## [[36]]
## [1] 112
## 
## [[37]]
## [1] 116
## 
## [[38]]
## [1] 130
## 
## [[39]]
## [1] 135
## 
## [[40]]
## [1] 132
## 
## [[41]]
## [1] 135
## 
## [[42]]
## [1] 135
## 
## [[43]]
## [1] 135
## 
## [[44]]
## [1] 139
## 
## [[45]]
## [1] 138
## 
## [[46]]
## [1] 137
## 
## [[47]]
## [1] 136
## 
## [[48]]
## [1] 146
## 
## [[49]]
## [1] 150
## 
## [[50]]
## [1] 154
## 
## [[51]]
## [1] 159
## 
## [[52]]
## [1] 157
## 
## [[53]]
## [1] 175
## 
## [[54]]
## [1] 178
## 
## [[55]]
## [1] 178
## 
## [[56]]
## [1] 192
## 
## [[57]]
## [1] 196
## 
## [[58]]
## [1] 200
## 
## [[59]]
## [1] 205
## 
## [[60]]
## [1] 205
## 
## [[61]]
## [1] 209
## 
## [[62]]
## [1] 218
## 
## [[63]]
## [1] 222
## 
## [[64]]
## [1] 231
## 
## [[65]]
## [1] 233
## 
## [[66]]
## [1] 237
## 
## [[67]]
## [1] 246
## 
## [[68]]
## [1] 261
## 
## [[69]]
## [1] 264
## 
## [[70]]
## [1] 274
## 
## [[71]]
## [1] 279
## 
## [[72]]
## [1] 299
## 
## [[73]]
## [1] 304
## 
## [[74]]
## [1] 328
## 
## [[75]]
## [1] 338
## 
## [[76]]
## [1] 343
## 
## [[77]]
## [1] 347
## 
## [[78]]
## [1] 357
## 
## [[79]]
## [1] 370
## 
## [[80]]
## [1] 375
## 
## [[81]]
## [1] 380
## 
## [[82]]
## [1] 389
## 
## [[83]]
## [1] 407
## 
## [[84]]
## [1] 430
## 
## [[85]]
## [1] 440
## 
## [[86]]
## [1] 458
## 
## [[87]]
## [1] 493
## 
## [[88]]
## [1] 508
## 
## [[89]]
## [1] 536
## 
## [[90]]
## [1] 575
## 
## [[91]]
## [1] 600
## 
## [[92]]
## [1] 634
## 
## [[93]]
## [1] 653
## 
## [[94]]
## [1] 673
## 
## [[95]]
## [1] 723

We use these values from 0 to 92 as with the values being closer to zero or one, we get only one column. Thus helping us find threshold.

We can see with the proceeding results the threshold cuts off well at 0.25.

y_pred_feature_selection_final <- predict(logistics_impFeatures, test, type = "response")
y_pred_final <- as.integer(y_pred_feature_selection_final > 0.25)
table(y_pred_final, test$V21)
##             
## y_pred_final   0   1
##            0  10   3
##            1  64 173