1

I've got some struggle creating a conditionMatrixin r using the code below.

I've googled the issue and found out that it seems to be realted to a divergence between the train - and test syet, but I've got absoluteley no clue what's causing this delta.

# Create DB with Topics
doctopicDB <- data.frame(doc_topic_distr)
doctopicDB <- cbind(doc_id = rownames(doctopicDB), doctopicDB)
doctopicDB$doc_id <- as.numeric(as.character(doctopicDB$doc_id))  

# Merge Topic Distr with rawDB
predictionDB <- merge(x=rawDBnoDups, 
                      y=doctopicDB, 
                      by="doc_id",
                      all=TRUE)

# Create SVM Subset: Helpful + All Topic Distributions
predictionDB <- subset(predictionDB, 
                       select = c(6,12:(11+nTopicsLDA)))

# Binary just all factors
 predictionDB <- data.frame(predictionDB[1], 
                           (predictionDB[-1] > 0) * 1)

# Get rid of potentially NAs
predictionDB <- na.omit(predictionDB) # get rid of NA rows
testForNA <- predictionDB[rowSums(is.na(predictionDB)) > 0,] #Debugging

# Debug:
# Split a 10% Testset for Debugging purpose
#trainData <- createDataPartition(y = predictionDB$Helpful, p= 0.9, list = FALSE)      ########################## Debug
#trainData <- predictionDB[-trainData,]

# Split DB in Train & Test Set ( 70/30%)
trainData <- createDataPartition(y = predictionDB$Helpful, p= 0.7, list = FALSE)
trainSet <- predictionDB[trainData,]
testSet <- predictionDB[-trainData,]

# Factorizing the target variable 
trainSet[["Helpful"]] = factor(trainSet[["Helpful"]])

# Setup the training method
# - method: repeated cross-validation
# - number: number of resample iterations
# - repeats: set to compute the repeated cross-validation

trctrl <- trainControl(method = "repeatedcv", 
                       number = 10, 
                       repeats = 3)

#parralelize SVM Code
cl <- makeCluster(detectCores()-1)        # save 1 core as spare
registerDoParallel(cl)

# machine learning code goes in here
svm_Linear <- train(Helpful ~ ., 
                    data = trainSet, 
                    method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", 
                                   "scale"),
                    tuneLength = 10)
stopCluster(cl)

Output:

> stopCluster(cl)
> 
> test_pred <- predict(svm_Linear, newdata = testSet)
> confusionMatrix(table(test_pred, testSet$Helpful))
Error in !all.equal(nrow(data), ncol(data)) : invalid argument type

A hint would be totally awesome!

Many thanks in advance!

//edit:

Input;

dput(table(test_pred, testSet$Helpful))

Output:

...
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 60L, 
0L, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L), .Dim = c(259L, 89L), .Dimnames = list(test_pred = c("0", 
"000", "1", "10", "100", "101", "102", "103", "104", "105", "106", 
"107", "108", "109", "11", "110", "111", "112", "113", "115", 
"116", "117", "118", "119", "12", "120", "122", "123", "124", 
"125", "126", "127", "129", "13", "130", "132", "133", "135", 
"136", "137", "138", "139", "14", "141", "142", "144", "145", 
"147", "148", "149", "15", "150", "151", "155", "156", "157", 
"158", "159", "16", "161", "162", "163", "164", "166", "168", 
"169", "17", "173", "174", "176", "179", "18", "181", "184", 
"188", "19", "190", "191", "192", "197", "198", "2", "20", "200", 
"205", "21", "212", "213", "215", "217", "219", "22", "221", 
"223", "224", "226", "228", "229", "23", "231", "232", "237", 
"24", "241", "242", "243", "245", "247", "25", "251", "254", 
"258", "259", "26", "261", "262", "265", "269", "27", "271", 
"274", "277", "279", "28", "285", "287", "29", "294", "295", 
"297", "3", "30", "301", "306", "31", "310", "313", "315", "316", 
"32", "323", "325", "328", "33", "334", "34", "340", "344", "347", 
"348", "35", "353", "36", "361", "37", "38", "382", "383", "39", 
"393", "396", "4", "40", "405", "408", "41", "410", "415", "42", 
"43", "431", "438", "44", "440", "443", "448", "45", "459", "46", 
"47", "474", "48", "482", "485", "49", "492", "5", "50", "51", 
"52", "53", "530", "538", "54", "546", "549", "55", "553", "56", 
"563", "57", "570", "571", "58", "59", "6", "60", "61", "62", 
"624", "63", "64", "65", "66", "661", "663", "667", "67", "68", 
"69", "7", "70", "706", "71", "72", "723", "73", "738", "74", 
"75", "76", "77", "78", "79", "8", "80", "81", "813", "817", 
"82", "825", "83", "84", "85", "86", "88", "89", "9", "90", "91", 
"92", "93", "94", "942", "95", "96", "965", "97", "98"), c("0", 
"1", "10", "103", "11", "116", "117", "12", "122", "13", "137", 
"14", "15", "155", "157", "158", "16", "17", "18", "19", "2", 
"20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "3", 
"30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "4", 
"40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "5", 
"50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "6", 
"60", "61", "62", "63", "65", "66", "67", "69", "7", "70", "71", 
"72", "75", "77", "79", "8", "80", "81", "82", "83", "84", "85", 
"9", "90")), class = "table")
Flocke Haus
  • 55
  • 1
  • 6
  • Please, can you share the ouput of `dput(table(test_pred, testSet$Helpful))` ? – Marco Sandri Aug 31 '19 at 17:07
  • Maybe it's the same problem as [here](https://stackoverflow.com/questions/19871043/r-package-caret-confusionmatrix-with-missing-categories) – Stéphane Laurent Aug 31 '19 at 17:34
  • Hey there, @MarcoSandri I've attached the output of ```dput(table(test_pred, testSet$Helpful))``` @StéphaneLaurent yes, I've found that one and it seems to work. I got an accuracy of about 45% which seems to be damn bad. Is there a way to improve the model? :/ Raw data came out of an LDA topic model. Thank you very much for taking the time to read this thread. – Flocke Haus Sep 01 '19 at 08:48
  • @FlockeHaus Are you sure that you paste the full output ? Why the `...` at the beginning ? – Marco Sandri Sep 01 '19 at 08:53
  • @FlockeHaus Anyway, the output `test_pred` of your model is a factor with many levels and this is very strange. What is the output of `str(trainSet$Helpful)` ? – Marco Sandri Sep 01 '19 at 08:56
  • @MarcoSandri took a while since I tried to calculate another model, but here we go: ```> str(trainSet$Helpful) Factor w/ 300 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...``` And no, there were a lot of ```0L,``` entries more, but seems like RStudio wasn't even able to post them all :O – Flocke Haus Sep 01 '19 at 15:49
  • Are you sure that you want to make prediction on a categorical variable Y with 300 categories ? Or should be numerical that variable ? – Marco Sandri Sep 01 '19 at 15:54
  • Mmh no, Y usually should be a numeric number betwenn 0 and approx 50 (?). It's something like a number of likes related to the specific doc. What makes the system think there are about 300 categories? :O – Flocke Haus Sep 01 '19 at 16:09

0 Answers0