I am using sklearn.SVC on a pandas DataFrame to predict categorical data. The feature vector named "feature_train" is a single time column (numpy.int64) and a few thousand tfidf columns (which very sparely contain numpy.float64 values):
Timestamp Start able acceptance acceptance criterion access account
113 646 0.0 0.0 0.0 0.0 0.0
342 1775 0.0 0.0 0.0 0.0 0.0
3 202 0.0 0.0 0.0 0.0 0.0
129 728 0.0 0.0 0.0 0.0 0.0
32 257 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... ...
140 793 0.0 0.0 0.0 0.0 0.0
165 919 0.0 0.0 0.0 0.0 0.0
180 1290 0.0 0.0 0.0 0.0 0.0
275 1644 0.0 0.0 0.0 0.0 0.0
400 2402 0.0 0.0 0.0 0.0 0.0
for reference, here is the column I am trying to predict named "label_train":
113 14
342 17
3 1
129 0
32 12
..
140 15
165 1
180 15
275 12
400 14
I immediately enter these two variables into a linear SVM:
clf = svm.SVC(kernel="linear")
clf.fit(feature_train, label_train) #<-- this takes forever
The indices are out of order because I use a train-test split function. When I run this DataFrame through sklearn.SVC(kernel="linear") it takes 4275 seconds to complete, but when I remove the 'Timestamp Start' column, it takes 6 seconds. Additionally, if I remove all tfidf columns so that only the 'Timestamp Start' is remaining, it also takes a very long time to train the model.
How come a single column of integers is substantially harder to train than 2000+ of floats? Is this normal behavior? If this is true, then if I added the remaining 3 timestamp columns, it would take too long to be worth using timestamps altogether.