0

I am trying to use SVM for a multi-class classification task.

I have a dataset called df, which I divided into a training and a test set with the following code:

sample <- df[sample(nrow(df), 10000),] # take a random sample of 10,000 from dataset df
sample <- sample %>% arrange(Date) # arrange chronologically
train <- sample[1:8000,] # 80% of the df dataset
test <- sample[8001:10000,] # 20% of the df dataset

This is what the training set looks like:

> str(train)
'data.frame':   8000 obs. of  45 variables:
 $ Date            : Date, format: "2008-01-01" "2008-01-01" "2008-01-02" ...
 $ Weekday         : chr  "Tuesday" "Tuesday" "Wednesday" "Wednesday" ...
 $ Season          : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Weekend         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Icao.type       : Factor w/ 306 levels "A124","A225",..: 7 29 112 115 107 10 115 115 115 112 ...
 $ Act.description : Factor w/ 389 levels "A300-600F","A330-200F",..: 9 29 161 162 150 13 162 162 162 161 ...
 $ Arr.dep         : Factor w/ 2 levels "A","D": 2 2 1 1 1 1 1 1 1 1 ...
 $ MTOW            : num  77 69 46 21 22 238 21 21 21 46 ...
 $ Icao.wtc        : chr  "Medium" "Medium" "Medium" "Medium" ...
 $ Wind.direc      : int  104 104 82 82 93 93 93 132 132 132 ...
 $ Wind.speed.vec  : int  35 35 57 57 64 64 64 62 62 62 ...
 $ Wind.speed.daily: int  35 35 58 58 65 65 65 63 63 63 ...
 $ Wind.speed.max  : int  60 60 70 70 80 80 80 90 90 90 ...
 $ Wind.speed.min  : int  20 20 40 40 50 50 50 50 50 50 ...
 $ Wind.gust.max   : int  100 100 120 120 130 130 130 140 140 140 ...
 $ Temp.daily      : int  24 24 -5 -5 4 4 4 34 34 34 ...
 $ Temp.min        : int  -7 -7 -25 -25 -13 -13 -13 11 11 11 ...
 $ Temp.max        : int  50 50 16 16 13 13 13 55 55 55 ...
 $ Temp.10.min     : int  -11 -11 -32 -32 -18 -18 -18 9 9 9 ...
 $ Sun.dur         : int  7 7 65 65 19 19 19 0 0 0 ...
 $ Sun.dur.prct    : int  9 9 83 83 24 24 24 0 0 0 ...
 $ Radiation       : int  173 173 390 390 213 213 213 108 108 108 ...
 $ Precip.dur      : int  0 0 0 0 0 0 0 5 5 5 ...
 $ Precip.daily    : int  0 0 0 0 -1 -1 -1 2 2 2 ...
 $ Precip.max      : int  0 0 0 0 -1 -1 -1 2 2 2 ...
 $ Sea.press.daily : int  10259 10259 10206 10206 10080 10080 10080 10063 10063 10063 ...
 $ Sea.press.max   : int  10276 10276 10248 10248 10132 10132 10132 10086 10086 10086 ...
 $ Sea.press.min   : int  10250 10250 10141 10141 10058 10058 10058 10001 10001 10001 ...
 $ Visibility.min  : int  1 1 40 40 43 43 43 58 58 58 ...
 $ Visibility.max  : int  59 59 75 75 66 66 66 65 65 65 ...
 $ Cloud.daily     : int  7 7 3 3 8 8 8 8 8 8 ...
 $ Humidity.daily  : int  96 96 86 86 77 77 77 82 82 82 ...
 $ Humidity.max    : int  99 99 92 92 92 92 92 90 90 90 ...
 $ Humidity.min    : int  91 91 74 74 71 71 71 76 76 76 ...
 $ Evapo           : int  2 2 4 4 2 2 2 1 1 1 ...
 $ Wind.discrete   : chr  "South East" "South East" "North East" "North East" ...
 $ Vmc.imc         : chr  "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ Beaufort        : num  3 3 4 4 4 4 4 4 4 4 ...
 $ Main.A          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Main.B          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Main.K          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Main.O          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Main.P          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Main.Z          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Runway          : Factor w/ 13 levels "04","06","09",..: 3 8 2 2 2 6 2 6 6 6 ...

Then, I try to tune the SVM parameters with the following code:

library(e1071)
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))

While this code has worked in the past, it now gives me the following error:

Error in newdata[, object$scaled, drop = FALSE] : 
  (subscript) logical subscript too long

The only thing I can think of that has changed is the rows in the dataset train, as running the first code block means taking a random sample of 10,000 (out of dataset df, that contains 3.5 million rows).

Does anyone know why I am getting this?

kiae
  • 169
  • 1
  • 13

2 Answers2

2

I recognise that this question was rather hard to solve without a good reproducible example.

However, I have found the solution to my problem and wanted to post it here for anyone who might be looking for this in the future.

Running the same code, but with selected columns from the train set:

tuned <- tune.svm(Runway ~ ., data = train[,c(1:2, 45)], gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))

gave me absolutely no problem. I continued adding more features until the error was reproduced. I found that the features Vmc.imc and Icao.wtc were causing the error and that they were both chr features. Using the following code:

train$Vmc.imc <- as.factor(train$Vmc.imc)
train$Icao.wtc <- as.factor(train$Icao.wtc)

to change them into factors and then rerunning

 tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))

solved my problem.

I do not know why the other chr features such as Weekday and Wind.discrete are not causing the same issue. If anyone knows the answer to this, I would be glad to find out.

kiae
  • 169
  • 1
  • 13
1

Similar to this thread here. I added the fact that if you neglect making all your character features factors, you will also receive this error when attempting to run predict.

rayphaistos1
  • 11
  • 1
  • 3