My rows are mismatched in my SVM scripting code for Kaggle

Question

I am reviewing my e1071 code for SVM for the Kaggle Titanic data. Last I knew, this part of it was working, but now I'm getting a rather strange error. When I try to build my data.frame so I can submit to kaggle, it seems my prediction is the size of my training set instead of the test set.

Problem

Error in data.frame(PassengerId = test$passengerid, Survived = prediction) : arguments imply differing number of rows: 418, 714

Obviously, they should both be 418 and I do not understand what is going wrong?

Details

Here is my script:

setwd("Path\\To\Data")
train <- read.csv("train.csv")
test <- read.csv("test.csv")

library("e1071")
bestModel = svm(Survived ~ Pclass + Sex + Age + Sex * Pclass, data = train, kernel = "linear", cost = 1)

prediction <- predict(bestModel, newData=test, type="response")
prediction[prediction >= 0.5] <- 1
prediction[prediction != 1] <- 0
prediction[is.na(prediction)] <- 0

This is the line that gives me the error:

predictionSubmit <- data.frame(PassengerId = test$passengerid, Survived = prediction)

Attempts

I have used names(train) and names(test) to verify my column variable names are the same. You can find the data here. I know my prediction code can be optimized into one line, but that isn't the issue here. I would appreciate a second pair of eyes on this issue. I am thinking about using the kernlab library, but was wondering if there was a syntatical sugar issue I was neglecting here. Thanks for your suggestions and clues.

There does not appear to be any errors in the syntax. Go through the standard trouble-shooting steps. Remove the objects and start again `rm(prediction, train, test, bestModel)`. Run it again. Before creating the data frame, check `nrow(test)` and `length(prediction)` — Pierre L, Jul 23 '16 at 22:38
bestModel will produce a row size > row size for training. The question is, why? The problem is with `bestModel ~ svm()`. No syntax errors, but still not working. It's weird. — hlyates, Jul 23 '16 at 23:05
Use `prediction[na.omit(names(prediction))]`. I will try to find out why this is necessary. — Pierre L, Jul 23 '16 at 23:13
I'm still getting '418, 714' difference. Where are you using this in your code to get it to work? — hlyates, Jul 23 '16 at 23:18
Okay, I reran this in the morning and cleared out my memory cache and everything was working. Weird, right? — hlyates, Jul 24 '16 at 16:32

Pierre L · Accepted Answer · 2016-07-23T23:48:43.967

2

#10 items in training set
y <- sample(0:1, 10, T)
x <- rnorm(10)
bestModel <- svm(y~x,kernel = "linear", cost = 1)

#Six in test set
prediction <- predict(bestModel, newdata=rnorm(6), type="response")

#Output has 10 values (unexpected)
prediction
#           1          2          3          4          5          6       <NA>       <NA> 
#  0.05163974 0.58048905 0.49524846 0.13524885 0.12592718 0.06082822 0.55393256 1.08488424 
#        <NA>       <NA> 
#  0.94836026 0.47679646 

#For correct output, remove names with <NA>
prediction[na.omit(names(prediction))]
#         1          2          3          4          5          6 
#0.05163974 0.58048905 0.49524846 0.13524885 0.12592718 0.06082822

edited Jul 23 '16 at 23:48

answered Jul 23 '16 at 23:30

Pierre L

28,203
6
47
69

Can you please duplicate my issue exactly? The code is open source and all you have to do is run the script. I have noticed in my case there is no in names but rather the ID range for training set rather than prediction set. Is this a bug? – hlyates Jul 24 '16 at 00:48
That being said, thank you so much for helping out. I indeed see the just like you do in this instance, but that doesn't seem to be the cause with the .csv data. :( – hlyates Jul 24 '16 at 00:49

My rows are mismatched in my SVM scripting code for Kaggle

1 Answers1

Linked