1

I am using sklearn.SVC on a pandas DataFrame to predict categorical data. The feature vector named "feature_train" is a single time column (numpy.int64) and a few thousand tfidf columns (which very sparely contain numpy.float64 values):

     Timestamp Start  able  acceptance  acceptance criterion  access  account  
113              646   0.0         0.0                   0.0     0.0      0.0   
342             1775   0.0         0.0                   0.0     0.0      0.0   
3                202   0.0         0.0                   0.0     0.0      0.0   
129              728   0.0         0.0                   0.0     0.0      0.0   
32               257   0.0         0.0                   0.0     0.0      0.0   
..               ...   ...         ...                   ...     ...      ...   
140              793   0.0         0.0                   0.0     0.0      0.0   
165              919   0.0         0.0                   0.0     0.0      0.0   
180             1290   0.0         0.0                   0.0     0.0      0.0   
275             1644   0.0         0.0                   0.0     0.0      0.0   
400             2402   0.0         0.0                   0.0     0.0      0.0   

for reference, here is the column I am trying to predict named "label_train":

113    14
342    17
3       1
129     0
32     12
       ..
140    15
165     1
180    15
275    12
400    14

I immediately enter these two variables into a linear SVM:

clf = svm.SVC(kernel="linear")
clf.fit(feature_train, label_train) #<-- this takes forever

The indices are out of order because I use a train-test split function. When I run this DataFrame through sklearn.SVC(kernel="linear") it takes 4275 seconds to complete, but when I remove the 'Timestamp Start' column, it takes 6 seconds. Additionally, if I remove all tfidf columns so that only the 'Timestamp Start' is remaining, it also takes a very long time to train the model.

How come a single column of integers is substantially harder to train than 2000+ of floats? Is this normal behavior? If this is true, then if I added the remaining 3 timestamp columns, it would take too long to be worth using timestamps altogether.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Peter Robe
  • 11
  • 2

2 Answers2

0

The answer was to scale the column's values between 0-1. Large values lead to a drastic decrease on performance.

Peter Robe
  • 11
  • 2
0

When you are using distance based algorithms like SVM you want your features to be normalised, so no feature is dominating your training. Have a look at this blog by Roberto Reif. You will also find tons of resources on why, what and how.

https://www.robertoreif.com/blog/2017/12/16/importance-of-feature-scaling-in-data-modeling-part-1-h8nla

Bartek Malysz
  • 922
  • 5
  • 14
  • 37