5

I'm working with scikit-learn on building some predictive models with SVMs. I have a dataset with around 5000 examples and about 700 features. I'm 5 fold cross validating with a 18x17 grid search on my training set then using the optimal parameters for my test set. the runs are taking a lot longer than I expected and I have noticed the following:

1) Some individual SVM training iterations seem to take only a minute, while others can take up to 15 minutes. Is this expected with different data and parameters (C and gamma, I'm using rbf kernel)?

2) I'm trying to use 64 bit python on windows to take advantage of the extra memory, but all my python processes seem to top at 1 gig in my task manager, I don't know if that has anything to do with the runtime.

3) I was using 32bit before and running on about the same dataset, and i remember (though I didn't save down the results) it being quite a bit faster. I used a third party build of scikit-learn for 64 bit windows, so I don't know if it's better to try this on 32 bit python? (source http://www.lfd.uci.edu/~gohlke/pythonlibs/)

Any suggestions on how I can reduce runtime would be greatly appreciated. I guess reducing the search space of my grid search will help but as I'm unsure of even the range of optimal parameters, I'd like to keep it as large as i can. If there are faster SVM implementations as well, please let me know, and I may try those.

Addendum: I went back and tried running the 32bit version again. It's much faster for some reason. It took about 3 hours to get to where the 64bit version got to in 16 hours. Why would there be such a difference?

yprez
  • 14,854
  • 11
  • 55
  • 70
tomas
  • 665
  • 1
  • 10
  • 14

3 Answers3

7

1) This is expected: small gamma and small regularization will select more support vectors hence the model will be more complex and longer to fit.

2) There is a cache_size argument that will be passed to the underlying libsvm library. However depending on your data, libsvm might or might not use all of the available cache.

3) No idea. I you run more timed experiments on both platforms please report your findings on the project mailing lists. This might deserve further investigation.

First check that you normalized your features (e.g. remove the mean and scale feature by variances if your data is a dense numpy array). For sparse data, just scale the features (or use a TF-IDF transform for text data for instance). See the preprocessing section of the doc.

Then you should start with a coarse grid (with large logarithmic steps), let say a 3x3 grid and then focus on the interesting areas by rerunning a 3x3 grid on that area. In general the C x gamma SVM params grid is quite smooth.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thanks for the answers ogrisel. They make a lot of sense. I'm not sure on the 32bit vs 64bit issue either, but if I get a chance, I will try to do a few more timed runs. My data is preprocessed (normalized to 0-1) and I increased the cache_size to 4000 for scikits (probably overkill). I'll definitely look into modifying my code so that it goes from a coarse grid to a smaller area, that will definitely help speed up my code. Thanks again. – tomas Feb 07 '12 at 15:05
  • @OGrisel, how about a general coarse-then-fine grid searcher ? – denis Feb 08 '12 at 18:08
  • I guess an additional question, what if you're doing a cross validation + grid search, how can u use a coarse then fine grid search? When you try to average or whatever over several cross validation runs, the search space won't match up when you use coarse then fine. I'm sure there's a good way that I'm unaware of/missing. – tomas Feb 08 '12 at 22:07
  • yes i understand but when cross validating, you'll be running the grid search five times. the finer 5x5 search space could be different for each cross validation. how does one average the fit over the cross validation runs to find the optimal test set parameters – tomas Feb 09 '12 at 16:10
  • "could be different": don't do that then. The initial say 5x5 coarse grid -> ncross different C, gamma -> *one* 3x3 subgrid -> *one* say 5x5 finer grid ... – denis Feb 09 '12 at 16:50
4

If you can afford this, consider using LinearSVC: libsvm-based SVCs have training complexity between O(n_features * n_samples^2) and O(n_features * n_samples^3), while LinearSVC (based on liblinear) has O(n_features*n_samples) training complexity and O(n_features) test complexity.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • In practice `SGDClassifier` is even faster for fitting linear SVM models in scikit-learn. And we have not implemented averaging yet :) – ogrisel Feb 10 '12 at 08:08
  • Can we asking LinearSVC to output probabilities? It doesn't have any parameters like SVC to control whether the outputs are probs. What do people typically do for this? – GabrielChu Jan 15 '19 at 22:58
3

SGD is very fast, but 1) linear only, not rbf, 2) parameters alpha eta0 ... which I have no idea how to vary: over to the expert, O. Grisel.

On 32 vs. 64 bit python (what hardware, what py version ?), I have no idea, but that might be worth a general question on SO -- there must be benchmark suites. Can you see CPU usage > 90 %, count garbage collections ?

denis
  • 21,378
  • 10
  • 65
  • 88
  • 1
    It's possible to approximate a non linear RBF kernel in a scalable way using [explicit feature maps](http://scikit-learn.org/dev/auto_examples/plot_kernel_approximation.html#example-plot-kernel-approximation-py) and a linear classifier such as SGDClassifier. – ogrisel Feb 10 '12 at 08:13
  • I never try to tune `eta0` (maybe I should). For `alpha` I just use `GridSearchCV` as I would for the `C` in `LinearSVC`. – ogrisel Feb 10 '12 at 21:40