1

I'm trying to work with some data here and compare the test performance of glm and lda.

The data is attached here.

This is my general plan to try to do both of these:

training = read.csv("train.csv")
testing = read.csv("test.csv")

model_glm <- glm(V1 ~.,family=binomial(link='logit'),data=training)
pred_glm <- predict(model_glm, testing)

library(MASS)
model_lda <- lda(V1 ~ ., data=training)
predict_lda <- predict(model_lda, testing)

#Calculating classification error
err_lda <- (pred_lda) - test$V1
err2_lda <- err_lda[err_lda != 0]
classification_error_lda = length(err2_lda)/length(test$V1)

However these do not work. I thought there was a multinomial family class but that doesn't seem to exist. Also, since my first column is the digits and the next are all grayscale values, I thought I do V1 ~ ., but I don't think that is correct for these cases either. Does anyone have any idea if my syntax/setup is wrong?

edit: I added how I'm trying to calculate classification error for LDA. However I don't think my original thing works, as it gives:

Error in (pred_lda) - test$V1 : non-numeric argument to binary operator

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
J. P.
  • 289
  • 4
  • 14
  • `pred_glm <- (model_glm, testing)` doesn't do anything. You are looking for `pred_glm <- predict(model_glm, testing)`. – Phil Feb 16 '17 at 06:40
  • Oh right sorry that was a typo on my part - I do have predict(model_glm, testing). However that gives me the same answer as lm... and I should be doing multiclass logistic regression here. I'm not totally sure how to implement that in this case though. – J. P. Feb 16 '17 at 06:46

1 Answers1

1

This is not a binary classification, rather it's a multi-class (digit) classification problem, where we have 10 class labels. So, instead of logistic regression, you need to use multinomial logit. Try the following, as we can see, the overall accuracy of prediction with multinomial logit model is higher than lda.

library(nnet)
model_mlogit <- multinom(V1 ~ ., data = training, MaxNWts=2581)
predict_mlogit <- predict(model_mlogit, testing)
library(MASS)
model_lda <- lda(V1 ~ ., data=training)
predict_lda <- predict(model_lda, testing)
library(caret)
confusionMatrix(predict_mlogit,testing$V1)
# output 
Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9
         0 343   0   5   2   5   4   1   0   7   0
         1   0 254   1   0   2   1   0   0   0   0
         2   3   2 163   4   5   0   4   2   7   0
         3   2   1   6 145   1   7   0   3   3   1
         4   3   1   8   1 168   3   4   5   1   3
         5   2   0   1   8   2 137   4   0   9   1
         6   2   1   1   1   4   3 156   0   0   0
         7   3   1   5   2   1   0   0 132   4   2
         8   1   1   7   3   4   2   1   0 130   5
         9   0   3   1   0   8   3   0   5   5 165

Overall Statistics

               Accuracy : 0.8934         
                 95% CI : (0.879, 0.9065)
    No Information Rate : 0.1789         
    P-Value [Acc > NIR] : < 2.2e-16      

                  Kappa : 0.8803         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity            0.9554   0.9621  0.82323  0.87349  0.84000  0.85625  0.91765  0.89796  0.78313  0.93220
Specificity            0.9854   0.9977  0.98507  0.98696  0.98395  0.98538  0.99347  0.99032  0.98696  0.98634
Pos Pred Value         0.9346   0.9845  0.85789  0.85799  0.85279  0.83537  0.92857  0.88000  0.84416  0.86842
Neg Pred Value         0.9902   0.9943  0.98074  0.98857  0.98232  0.98752  0.99239  0.99192  0.98057  0.99340
Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972  0.08470  0.07324  0.08271  0.08819
Detection Rate         0.1709   0.1266  0.08122  0.07225  0.08371  0.06826  0.07773  0.06577  0.06477  0.08221
Detection Prevalence   0.1829   0.1286  0.09467  0.08421  0.09816  0.08171  0.08371  0.07474  0.07673  0.09467
Balanced Accuracy      0.9704   0.9799  0.90415  0.93023  0.91198  0.92082  0.95556  0.94414  0.88505  0.95927

confusionMatrix(predict_lda$class,testing$V1)
#output
Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9
         0 342   0   7   3   1   6   1   0   5   0
         1   0 251   2   0   4   0   0   1   0   0
         2   0   0 157   3   6   0   3   0   2   0
         3   4   2   4 142   0  16   0   2  11   0
         4   3   5  12   3 174   3   3   7   7   4
         5   1   0   2   9   0 125   3   0   4   0
         6   5   3   1   0   2   0 157   0   0   0
         7   0   0   1   1   2   0   0 129   0   5
         8   3   1  12   4   1   5   3   1 135   3
         9   1   2   0   1  10   5   0   7   2 165

Overall Statistics

               Accuracy : 0.8854         
                 95% CI : (0.8706, 0.899)
    No Information Rate : 0.1789         
    P-Value [Acc > NIR] : < 2.2e-16      

                  Kappa : 0.8713         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity            0.9526   0.9508  0.79293  0.85542  0.87000  0.78125  0.92353  0.87755  0.81325  0.93220
Specificity            0.9860   0.9960  0.99226  0.97882  0.97399  0.98971  0.99401  0.99516  0.98207  0.98470
Pos Pred Value         0.9370   0.9729  0.91813  0.78453  0.78733  0.86806  0.93452  0.93478  0.80357  0.85492
Neg Pred Value         0.9896   0.9926  0.97767  0.98686  0.98544  0.98121  0.99293  0.99037  0.98314  0.99338
Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972  0.08470  0.07324  0.08271  0.08819
Detection Rate         0.1704   0.1251  0.07823  0.07075  0.08670  0.06228  0.07823  0.06428  0.06726  0.08221
Detection Prevalence   0.1819   0.1286  0.08520  0.09018  0.11011  0.07175  0.08371  0.06876  0.08371  0.09616
Balanced Accuracy      0.9693   0.9734  0.89260  0.91712  0.92200  0.88548  0.95877  0.93636  0.89766  0.95845

[EDIT] Without caret:

table(predict_mlogit,testing$V1)
# output
predict_mlogit   0   1   2   3   4   5   6   7   8   9
             0 343   0   5   2   5   4   1   0   7   0
             1   0 254   1   0   2   1   0   0   0   0
             2   3   2 163   4   5   0   4   2   7   0
             3   2   1   6 145   1   7   0   3   3   1
             4   3   1   8   1 168   3   4   5   1   3
             5   2   0   1   8   2 137   4   0   9   1
             6   2   1   1   1   4   3 156   0   0   0
             7   3   1   5   2   1   0   0 132   4   2
             8   1   1   7   3   4   2   1   0 130   5
             9   0   3   1   0   8   3   0   5   5 165
# accuracy
sum(predict_mlogit==testing$V1)/length(testing$V1)
# [1] 0.8933732

table(predict_lda$class,testing$V1)
# output
      0   1   2   3   4   5   6   7   8   9
  0 342   0   7   3   1   6   1   0   5   0
  1   0 251   2   0   4   0   0   1   0   0
  2   0   0 157   3   6   0   3   0   2   0
  3   4   2   4 142   0  16   0   2  11   0
  4   3   5  12   3 174   3   3   7   7   4
  5   1   0   2   9   0 125   3   0   4   0
  6   5   3   1   0   2   0 157   0   0   0
  7   0   0   1   1   2   0   0 129   0   5
  8   3   1  12   4   1   5   3   1 135   3
  9   1   2   0   1  10   5   0   7   2 165
# accuracy
sum(predict_lda$class==testing$V1)/length(testing$V1)
# [1] 0.8854011
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • Hi thank you for this! Is your classification rate 1 - Accuracy then? To solve linear regression, I did model_lm <- lm(V1 ~ ., data = training) pred_lm <- predict(model_lm, testing) err <- round(pred_lm) - test$V1 err2 <- err[err != 0] classification_error_lm = (length(err2)/length(test$V1)) This gave me a classification error of 0.77... however, it does make sense that LDA and GLM are more accurate. – J. P. Feb 16 '17 at 07:06
  • you mean mis-classification rate i guess, yes it's 1-Accuracy. However this is a classifcation and not a regression problem, so you should not use any regression algorithm such as linear regression, instead use some classfication algorith,. – Sandipan Dey Feb 16 '17 at 07:13
  • Hi I agree, since it is a classification problem - I was just doing linear regression as a comparison to the other ones (in that the error should be pretty bad). I ended up getting a very high error but since this is a classification problem, I think that makes sense. – J. P. Feb 16 '17 at 07:21
  • One last thing if you don't mind: Getting the accuracy number using confusionMatrix. I tried installing caret using install.packages("caret", dependencies = TRUE) and then calling it: library(caret). However I end up getting this set of errors: Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : there is no package called ‘ModelMetrics’ In addition: Warning messages: 1: package ‘caret’ was built under R version 3.2.5 2: package ‘ggplot2’ was built under R version 3.2.5 Error: package or namespace load failed for ‘caret’ – J. P. Feb 16 '17 at 07:21
  • hmm, install all the dependencies manually then one by one, e.g., with `install.packages('ModelMetrics')` – Sandipan Dey Feb 16 '17 at 07:25
  • Hm I did try that and get the error: Warning in install.packages : package ‘ModelMetrics’ is not available (for R version 3.2.1) I tried checking why this might be but can't find anything. Do you by any chance have a suggestion on how to get the accuracy or the confusion matrix without using this function from library(caret)? – J. P. Feb 16 '17 at 07:31
  • you can compute accuracy without `caret` too, updated the post. – Sandipan Dey Feb 16 '17 at 07:42
  • Great thank you so much! I have accepted the answer :) – J. P. Feb 16 '17 at 07:45