2

I did a classification with svm using e1071. The goal is to predict type through all other variables in dtm.

 dtm[140:145] %>% str()
 'data.frame':  385 obs. of  6 variables:
 $ think   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ actually: num  0 0 0 0 0 0 0 0 0 0 ...
 $ comes   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ able    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hours   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ type    : Factor w/ 4 levels "-1","0","1","9": 4 3 3 3 4 1 4 4 4 3 ...

To train/test the model, I used the 10-fold-cross-validation.

model <- svm(type~., dtm, cross = 10, gamma = 0.5, cost = 1)
summary(model)

Call:
svm(formula = type ~ ., data = dtm, cross = 10, gamma = 0.5, cost = 1)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
     gamma:  0.5 

Number of Support Vectors:  385

 ( 193 134 41 17 )


Number of Classes:  4 

Levels: 
 -1 0 1 9

10-fold cross-validation on training data:

Total Accuracy: 50.12987 
Single Accuracies:
 52.63158 51.28205 52.63158 43.58974 60.52632 43.58974 57.89474 48.71795 
 39.47368 51.28205 

My question is how can I generate a confusion matrix for the results? Which columns of model do I have to put in table()or confusionMatrix() to get the matrix?

Banjo
  • 1,191
  • 1
  • 11
  • 28
  • model doesn't have columns, you need to generate predictions on your test data then use `table()`, see [here](https://stackoverflow.com/questions/40080794/calculating-prediction-accuracy-of-a-tree-using-rparts-predict-method-r-progra) or [use confusitonMatrix() from caret](https://stackoverflow.com/a/41443133/4964651) – mtoto Jan 22 '18 at 10:37
  • I don´t have enough data. That's the reason I used that method. The 10-CV says that it´s doing the test/train procedure 10 times with a test/train ratio of 0.1/0.9 at different parts of the dataset. So the test data must be already in the model?! If I can´t get the confusion matrix is there a way to calculate accuracy with a CI, recall, and precision? – Banjo Jan 22 '18 at 10:57
  • what package are u using? – mtoto Jan 22 '18 at 11:02
  • `e1071` but I also could use `caret`. I like the tune function in `e1071` – Banjo Jan 22 '18 at 11:04

2 Answers2

4

As far as I know there is no method to access the fold predictions in library e1071 when doing cross validation.

One easy way to do it:

some data:

library(e1071)
library(mlbench)
data(Sonar)

generate the folds:

k <- 10
folds <- sample(rep(1:k, length.out = nrow(Sonar)), nrow(Sonar))

run the models:

z <- lapply(1:k, function(x){
  model <- svm(Class~., Sonar[folds != x, ], gamma = 0.5, cost = 1, probability = T)
  pred <- predict(model, Sonar[folds == x, ])
  true <- Sonar$Class[folds == x]
  return(data.frame(pred = pred, true = true))
})

to generate confusion matrix for all left out samples:

z1 <- do.call(rbind, z)
caret::confusionMatrix(z1$pred, z1$true)

to generate for each:

lapply(z, function(x){
  caret::confusionMatrix(x$pred, x$true)
})

for reproducibility set the seed prior the fold creation.

In general if you do this sort of stuff often chose a higher level library such as mlr or caret.

missuse
  • 19,056
  • 3
  • 25
  • 47
1

Suppose you want to create a confusion matrix of predictions and real values from dataset called dtm, where your target variable is called type. First of all you have to predict the value according to the model using:

prediction <- predict(model, dtm)

Then you can create the confusion matrix with the code:

library(caret)
confusionMatrix(prediction, dtm$type, dnn = c("Prediction", "Reference"))

Hope it's clear enough.

  • When I use `prediction <- predict(model, dtm)` and `table(prediction, dtm$type)` the results are not valid because the model predicting the values it's already known. So, I get an accuracy near 1. – Banjo Jan 22 '18 at 10:51
  • 1
    Well, I don't think this is a problem of code but more conceptual. These commands create exactly the predictions that your model makes out of the same dataset you used for training. If accuracy is close to 1, it can be a problem of overfitting or unbalanced classes. – Davide Bottoli Jan 22 '18 at 10:58
  • @ Davide Bottoli For the model above I get an accuracy of 0.6. If I do the prediction I get 0.92 to 0.98. But you are right the classes are imbalanced. If I use the approach from @missuse it´s working. But your comment makes me doubt that I understand the procedure completely. – Banjo Jan 22 '18 at 11:15
  • @ Davide Bottoli Is it allowed/valid to evaluate the model through `prediction <- predict(model, dtm) and table(prediction, dtm$type)`? – Banjo Jan 22 '18 at 11:21
  • It's allowed. I mean, that'sactually what your model is predicting. If you want to go deeply inside the model I suggest you to check the other answer. – Davide Bottoli Jan 22 '18 at 13:35
  • The goal of the classification is an "automated quantitative content analyses". The sample is representative regarding size and content. However, I don´t think the results would be the same if I would use the model to predict another randomly selected sample of that kind. Thank´s for the Input, verry helpfull! – Banjo Jan 22 '18 at 15:59