I have a tab delimited file with 70 rows of data and 34 columns of characteristics, where the first 60 rows look like this:
groups x1 x2 x3 x4 x5 (etc, up to x34)
0 0.1 0.5 0.5 0.4 0.2
1 0.2 0.3 0.8 0.4 0.1
0 0.4 0.7 0.6 0.2 0.1
1 0.4 0.4 0.7 0.1 0.4
And the last 10 rows look like this:
groups x1 x2 x3 x4 x5
NA 0.2 0.1 0.5 0.4 0.2
NA 0.2 0.1 0.8 0.4 0.1
NA 0.2 0.2 0.6 0.2 0.1
NA 0.2 0.3 0.7 0.1 0.4
The groups are binary (i.e. each row either belongs to group 0 or group 1). The aim is to use the first 60 rows as my training data set, and the last 10 rows as my test data set; to classify the last 10 rows into groups 0 or 1. The class of the last 10 rows is currently labelled as "NA" (as they have not been assigned to a class).
I ran this code:
library(caret)
data <-read.table("data_challenge_test.tab",header=TRUE)
set.seed(3303)
train <-sample(1:60)
data.train <-data[train,]
dim(data.train)
data.test <-data[-train,]
dim(data.test)
data.train[["groups"]] = factor(data.train[["groups"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knn_fit <- train(groups ~x1+x2+x3+x4+x5, data = data.train, method = "knn",trControl=trctrl,preProcess = c("center", "scale"),tuneLength = 10)
test_pred <- predict(knn_fit, newdata = data.test)
confusionMatrix(test_pred, data.test$groups)
the test_pred output is:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
and the confusion matrix output is:
> confusionMatrix(test_pred, data.test$groups)
Error in confusionMatrix.default(test_pred, data.test$groups) :
the data cannot have more levels than the reference
Then I checked the str of test_pred and data.test$groups:
> str(test_pred)
Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1
> str(data.test$groups)
int [1:10] NA NA NA NA NA NA NA NA NA NA
So I understand that my error is because my two inputs to the confusion matrix are not of the same type.
So then in my data set, I changed my "NA" columns to randomly either 0 or 1 (i.e. I just manually randomly changed the first 5 unknown classes to class 0 and then the second 5 unknown classes to class 1).
Then I re-ran the above code
The output was:
> test_pred
[1] 0 0 0 0 1 1 0 1 1 0
Levels: 0 1
> confusionMatrix(test_pred, data.test$groups)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4 2
1 1 3
Accuracy : 0.7
95% CI : (0.3475, 0.9333)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.1719
Kappa : 0.4
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.8000
Specificity : 0.6000
Pos Pred Value : 0.6667
Neg Pred Value : 0.7500
Prevalence : 0.5000
Detection Rate : 0.4000
Detection Prevalence : 0.6000
Balanced Accuracy : 0.7000
'Positive' Class : 0
So I have three questions:
- Originally, the class of all my training data set was 0 or 1, and the class of my test data sets were all marked as NA or ?.
caret doesn't seem to like that due to the error described above. When I assigned my test data set random starting binary variables instead of NA/?, the analysis "worked" (as in no errors).
Is the binary groups I've manually randomly assigned to the test data set affecting the confusion matrix (or any aspect of the analysis?), or is this acceptable? If not, what is the solution: what group do I assigned unclassified test data to in the beginning of the analysis.
Is the test_pred output ordered? I wanted the last 10 rows of my table to be predicted and the output of test_pred is: 0 0 0 0 1 1 0 1 1 0. Are these last 10 rows in order?
I would like to visualise the results once I have this issue sorted. Can anyone recommend a standard package that is commonly done to do this (I am new to machine learning)?
Edit: Given that the confusion matrix is directly uses references and prediction to calculate accuracy, I'm pretty sure I cannot just randomly assign classes to the unknown classed rows as it will affect the accuracy of the confusion matrix. So an alternative suggestion would be appreciated.