Why does a single column cause my SVM to take an hour?

Question

I am using sklearn.SVC on a pandas DataFrame to predict categorical data. The feature vector named "feature_train" is a single time column (numpy.int64) and a few thousand tfidf columns (which very sparely contain numpy.float64 values):

     Timestamp Start  able  acceptance  acceptance criterion  access  account  
113              646   0.0         0.0                   0.0     0.0      0.0   
342             1775   0.0         0.0                   0.0     0.0      0.0   
3                202   0.0         0.0                   0.0     0.0      0.0   
129              728   0.0         0.0                   0.0     0.0      0.0   
32               257   0.0         0.0                   0.0     0.0      0.0   
..               ...   ...         ...                   ...     ...      ...   
140              793   0.0         0.0                   0.0     0.0      0.0   
165              919   0.0         0.0                   0.0     0.0      0.0   
180             1290   0.0         0.0                   0.0     0.0      0.0   
275             1644   0.0         0.0                   0.0     0.0      0.0   
400             2402   0.0         0.0                   0.0     0.0      0.0

for reference, here is the column I am trying to predict named "label_train":

I immediately enter these two variables into a linear SVM:

clf = svm.SVC(kernel="linear")
clf.fit(feature_train, label_train) #<-- this takes forever

The indices are out of order because I use a train-test split function. When I run this DataFrame through sklearn.SVC(kernel="linear") it takes 4275 seconds to complete, but when I remove the 'Timestamp Start' column, it takes 6 seconds. Additionally, if I remove all tfidf columns so that only the 'Timestamp Start' is remaining, it also takes a very long time to train the model.

How come a single column of integers is substantially harder to train than 2000+ of floats? Is this normal behavior? If this is true, then if I added the remaining 3 timestamp columns, it would take too long to be worth using timestamps altogether.

That was it! Im not sure why scaling leads to an increase in performance, but at least now I know. — Peter Robe, Oct 30 '19 at 20:54

score 0 · Answer 1 · answered Oct 30 '19 at 20:57

0

The answer was to scale the column's values between 0-1. Large values lead to a drastic decrease on performance.

answered Oct 30 '19 at 20:57

Peter Robe

11
2

score 0 · Answer 2 · answered Nov 03 '19 at 14:09

When you are using distance based algorithms like SVM you want your features to be normalised, so no feature is dominating your training. Have a look at this blog by Roberto Reif. You will also find tons of resources on why, what and how.

https://www.robertoreif.com/blog/2017/12/16/importance-of-feature-scaling-in-data-modeling-part-1-h8nla

Why does a single column cause my SVM to take an hour?

2 Answers2