1

For "small" data set it works just fine. However, for one of my biggest sets (n=498.706) the model give me the error ( see bottom ).

Any ideas what may cause this issue?

Code Snipet

# Binary just the target var
predictionDB$Helpful[predictionDB$Helpful>0] <- 1

# Binary all factors
#predictionDB <- data.frame(predictionDB[1], (predictionDB[-1] > 0) * 1)


# Get rid of potentially NAs
predictionDB <- na.omit(predictionDB) # get rid of NA rows


# Split DB in Train & Test Set ( 70/30%)
trainData <- createDataPartition(y = predictionDB$Helpful, p= 0.7, list = FALSE)
trainSet <- predictionDB[trainData,]
testSet <- predictionDB[-trainData,]

# Factorizing the target variable 
trainSet[["Helpful"]] = factor(trainSet[["Helpful"]])

trctrl <- trainControl(method = "repeatedcv", 
                       number = 10, 
                       repeats = 3,
                       allowParallel = TRUE)

#parralelize SVM Code
cl <- makeCluster(detectCores()-1)        # save 1 core as spare
registerDoParallel(cl)

# machine learning code goes in here
svm_Linear <- train(Helpful ~ ., 
                    data = trainSet, 
                    method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", 
                                   "scale"),
                    tuneLength = 10)
stopCluster(cl)

test_pred <- predict(svm_Linear, newdata = testSet)

confusionMatrix(table(test_pred, testSet$Helpful))

Warnings

Warning in .Internal(gc(verbose, reset, full)) :
  closing unused connection 34 (<-localhost:11963)


Error:

There were missing values in resampled performance measures.Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :1     NA's   :1    
Error: Stopping
r2evans
  • 141,215
  • 6
  • 77
  • 149
Flocke Haus
  • 55
  • 1
  • 6
  • It's going to be difficult to give good advice on this without actually working with the data, unfortunately. Your last "error" text that shows two columns of summary data ... the columns are consistent with a single row of data with `NA` (try `summary(NA_real_)`); after what part of your code does this occur? Are you certain that your data has sufficient rows when you partition it? – r2evans Sep 03 '19 at 23:35
  • Hey @r2evans, thanks for your reply. The error appears running the svm_linear - line. That's what it's giving me: summary(NA_real_) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's NA NA NA NaN NA NA 1 – Flocke Haus Sep 04 '19 at 04:32
  • Are you certain that your data (`trainSet`) has sufficient rows when you call `train`? – r2evans Sep 04 '19 at 04:34
  • 1
    @r2evans yes, it contains 348421 rows. ```testForNA <- testSet[rowSums(is.na(predictionDB)) > 0,] ``` give's me a data frame with 0 entries. :/ – Flocke Haus Sep 04 '19 at 05:35
  • I've did some tests and found out that whenever I try to use a dataframe of > 100.000 samples the error above occures. Is there some kind of technical limit using this implementation of SVM? – Flocke Haus Sep 04 '19 at 19:57

1 Answers1

0

I think I've found the error, but I will have to take a look first.

outfile = "Log.txt" to cl <- makeCluster(detectCores()-2, outfile = "Log.txt")

gave me the following hint:

Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
In addition: Warning messages:
1: model fit failed for Fold04.Rep2: C=1 Error : cannot allocate vector of size 23.3 Gb

2: model fit failed for Fold10.Rep3: C=1 Error : cannot allocate vector of size 23.3 Gb```

So I`ll try to add some additional RAM to my VM.

Just in case someone else will have similar issues, hope this helps.
Flocke Haus
  • 55
  • 1
  • 6