0

I have a tab delimited file with 70 rows of data and 34 columns of characteristics, where the first 60 rows look like this:

groups x1    x2     x3    x4   x5 (etc, up to x34)
0    0.1    0.5    0.5   0.4  0.2
1    0.2    0.3    0.8   0.4  0.1
0    0.4    0.7    0.6   0.2  0.1
1    0.4    0.4    0.7   0.1  0.4

And the last 10 rows look like this:

groups x1    x2     x3    x4   x5
NA    0.2    0.1    0.5   0.4  0.2
NA    0.2    0.1    0.8   0.4  0.1
NA    0.2    0.2    0.6   0.2  0.1
NA    0.2    0.3    0.7   0.1  0.4

The groups are binary (i.e. each row either belongs to group 0 or group 1). The aim is to use the first 60 rows as my training data set, and the last 10 rows as my test data set; to classify the last 10 rows into groups 0 or 1. The class of the last 10 rows is currently labelled as "NA" (as they have not been assigned to a class).

I ran this code:

library(caret)
data <-read.table("data_challenge_test.tab",header=TRUE)
set.seed(3303)
train <-sample(1:60)
data.train <-data[train,]
dim(data.train)
data.test <-data[-train,]
dim(data.test)
data.train[["groups"]] = factor(data.train[["groups"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(groups ~x1+x2+x3+x4+x5, data = data.train, method = "knn",trControl=trctrl,preProcess = c("center", "scale"),tuneLength = 10)
test_pred <- predict(knn_fit, newdata = data.test)
confusionMatrix(test_pred, data.test$groups)

the test_pred output is:

> test_pred
 [1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1

and the confusion matrix output is:

> confusionMatrix(test_pred, data.test$groups)
Error in confusionMatrix.default(test_pred, data.test$groups) : 
  the data cannot have more levels than the reference

Then I checked the str of test_pred and data.test$groups:

> str(test_pred)
 Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1
> str(data.test$groups)
 int [1:10] NA NA NA NA NA NA NA NA NA NA

So I understand that my error is because my two inputs to the confusion matrix are not of the same type.

So then in my data set, I changed my "NA" columns to randomly either 0 or 1 (i.e. I just manually randomly changed the first 5 unknown classes to class 0 and then the second 5 unknown classes to class 1).

Then I re-ran the above code

The output was:

> test_pred
 [1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
> confusionMatrix(test_pred, data.test$groups)
Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 4 2
         1 1 3

               Accuracy : 0.7             
                 95% CI : (0.3475, 0.9333)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : 0.1719          

                  Kappa : 0.4             
 Mcnemar's Test P-Value : 1.0000          

            Sensitivity : 0.8000          
            Specificity : 0.6000          
         Pos Pred Value : 0.6667          
         Neg Pred Value : 0.7500          
             Prevalence : 0.5000          
         Detection Rate : 0.4000          
   Detection Prevalence : 0.6000          
      Balanced Accuracy : 0.7000          

       'Positive' Class : 0  

So I have three questions:

  1. Originally, the class of all my training data set was 0 or 1, and the class of my test data sets were all marked as NA or ?.

caret doesn't seem to like that due to the error described above. When I assigned my test data set random starting binary variables instead of NA/?, the analysis "worked" (as in no errors).

Is the binary groups I've manually randomly assigned to the test data set affecting the confusion matrix (or any aspect of the analysis?), or is this acceptable? If not, what is the solution: what group do I assigned unclassified test data to in the beginning of the analysis.

  1. Is the test_pred output ordered? I wanted the last 10 rows of my table to be predicted and the output of test_pred is: 0 0 0 0 1 1 0 1 1 0. Are these last 10 rows in order?

  2. I would like to visualise the results once I have this issue sorted. Can anyone recommend a standard package that is commonly done to do this (I am new to machine learning)?

Edit: Given that the confusion matrix is directly uses references and prediction to calculate accuracy, I'm pretty sure I cannot just randomly assign classes to the unknown classed rows as it will affect the accuracy of the confusion matrix. So an alternative suggestion would be appreciated.

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60

1 Answers1

1
  1. A confusion matrix is comparison of your classification output to the actual classes. So if your test data set does not have the labels, you cannot draw a confusion matrix. There are other ways of checking how well your classification algorithm did. You can for now read about AIC which is analogous to Linear Regressions R-squared. If you still want a confusion matrix, use first 50 rows for training and 50-60 for testing. This output will let you create a confusion matrix.
  2. Yes the output is ordered and you can column bind it to your test set.
  3. Visualising classification tasks is done by drawing a ROC curve. CARET library should have that too.
rogerthat
  • 26
  • 3
  • Thank you, yes I had just edited my question with your answer to 1. Your answers make sense and I appreciate it. Just one thing to clarify, that the actual method is ok? i.e. that is is acceptable to randomly assign unclassed rows to 0 or 1 and then do the prediction, rather than leave them as ? or NA. – Slowat_Kela Aug 04 '17 at 15:01