libsvm Shrinking Heuristics

Question

I'm using libsvm in C-SVC mode with a polynomial kernel of degree 2 and I'm required to train multiple SVMs. During training, I am getting either one or even both of these warnings for some of the SVMs that I train:

WARNING: using -h 0 may be faster
*
WARNING: reaching max number of iterations
optimization finished, #iter = 10000000

I've found the description for the h parameter:

-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)

and I've tried to read the explanation from the libsvm documentation, but it's a bit too high level for me. Can anyone please provide a layman's explanation and, perhaps, some suggestions like setting this would be beneficial because...? Also, it would be helpful to know if by setting this parameter for all the SVMs that I train, might produce negative impact on accuracy for those SVMs that do not explicitly give this warning.

I'm not sure what to make of the other warning.

Just to give more details: my training sets have 10 attributes (features) and they consist of 5000 vectors.

Update:

In case anybody else is getting the "reaching max number of iterations", it seems to be caused by numeric stability issues. Also, this will produce a very slow training time. Polynomial kernels do benefit from using cross-validation techniques to determine the best value for regularization (the C parameter), and, in the case of polynomial kernels, for me it helped to keep it smaller than 8. Also, if the kernel is inhomogeneous \sum(\gamma x_i s_i + coef0)^d (sorry, LaTeX is not supported on SO), where coef0 != 0, then cross validation can be implemented with a grid search technique for both gamma and C, since, in this case, the default value for gamma (1 / number_of_features) might not be the best choice. Still, from my experiments, you probably do not want gamma to be too big, since it will cause numeric issues (I am trying a maximum value of 8 for it).

For further inspiration on the possible values for gamma and C one should try poking in grid.py.

Please explain how to come out with that gamma equals 1 over number of features and gamma upper limit to eight. Thanks. — Cloud Cho, Mar 20 '20 at 17:12
@CloudCho It has been quite a few years since then and I can't recall precisely, but I believe I started with the default value (1/num_features - see [here](https://github.com/cjlin1/libsvm/blob/557d85749aaf0ca83fd229af0f00e4f4cb7be85c/svm-train.c#L29)) and I tried to increase it gradually until I started getting that max iterations warning. If you want to get some good starting values for gamma and C, you'll need to trace how [these values](https://github.com/cjlin1/libsvm/blob/557d85749aaf0ca83fd229af0f00e4f4cb7be85c/tools/grid.py#L29-L30) get transformed until they're fed to svmtrain. — Mihai Todor, Mar 21 '20 at 00:17
@CloudCho Also, it's super-important to scale your training data before trying to train a model because otherwise you'll run into numerical issues and your model will perform poorly. libsvm provides a tool called `svm-scale` for this purpose. See [here](https://github.com/cjlin1/libsvm/blob/master/svm-scale.c) — Mihai Todor, Mar 21 '20 at 00:20

score 12 · Accepted Answer · answered Sep 20 '12 at 10:30

12

The shrinking heuristics are there to speed up the optimization. As it says in the FAQ, they sometimes help, and sometimes they do not. I believe it's a matter of runtime, rather than convergence.

The fact that the optimization reaches the maximum number of iterations is interesting, though. You might want to play with the tolerance (cost parameter), or have a look at the individual problems that cause this. Are the datasets large?

answered Sep 20 '12 at 10:30

Qnan

3,714
18
15

Thanks for the answer! I think you are right regarding the shrinking heuristics. They just help train the models faster. – Mihai Todor Sep 20 '12 at 12:55
Regarding the maximum iterations, my datasets have 5000 items each. The training takes less than one minute. What is the cost parameter? Is it the regularization? Right now I'm just setting it to 1, the default value in libsvm... – Mihai Todor Sep 20 '12 at 12:57
Oh, I should clarify this: my training sets have 10 attributes / features and they consist of 5000 vectors. – Mihai Todor Sep 20 '12 at 13:13
1

@MihaiTodor that should not present a problem for SVM, I think, unless you have many points with different labels and exactly the same feature vectors. The cost parameter is `-c` in LIBSVM, it defines how much you penalize classification errors. If it's too high, and the dataset isn't linearly separable in your kernel space, it might cause trouble. – Qnan Sep 20 '12 at 13:33
@MihaiTodor did you mean 5000 training instances and 10-dimensional feature space?... – Qnan Sep 20 '12 at 13:34
Yes, that is what I have and right now I'm seeing the iterations warning with `c` set to 1 (the default value). I plan to tweak both `c` and `gamma` to get better accuracy using cross-validation and grid search, but what should I do when I get this warning? – Mihai Todor Sep 20 '12 at 13:39
@MihaiTodor check the datasets that cause trouble. It really shouldn't take so long with only 10 dimensions. Any particular reason for using polynomial kernel, by the way? – Qnan Sep 20 '12 at 13:44
Well, the short story is that we want to build a secure recommender system that receives an encrypted vector from a client and exploits the homomorphic addition property of certain public-key encryption schemes in order to compute the prediction. Because the scheme is not fully homomorphic, we can only use certain kernels, like the polynomial one. How should I "check" the datasets and which method should I use? Eye inspection does not reveal much. Here is a sample: [part1](http://pastebin.com/vR8F6qEk) and [part2](http://pastebin.com/qpR6W3CN) – Mihai Todor Sep 20 '12 at 14:21
I see.. Make sure you scale the data prior to training. See `svm-scale` for details. I think that might be the problem. – Qnan Sep 20 '12 at 14:46
Yes, I also was thinking about this, but, unfortunately, due to the nature of our system, that will not be an option because we won't be able to match the same feature scaling on the client data. So, based on your experience and the data I provided, what should I expect from the model when I get this warning? – Mihai Todor Sep 20 '12 at 14:56
1

Why not? You don't have to adjust the scaling for the test data, just reapply a given one, determined from the training data http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f407 LIBSVM expects scaled data, at least roughly in the [-1;1] interval, and it seems to solve the problem with the test data you posted above. – Qnan Sep 20 '12 at 15:01
As to the warning, if the model did not converge, you shouldn't make any assumptions about it's performance on the test data. I.e. it might just yield random results. – Qnan Sep 20 '12 at 15:03
I see... Well, the problem is that the test data needs to be fed to 30+ SVMs, so either I have to ask the client to scale the data for each particular SVM, encrypt it and send it (not acceptable), or I have to ask the server to scale the encrypted data (not feasable / interactive secure division is way too expensive). I'll have to discuss with my colleagues and see if we can find a way to get around this issue. Thank you very much. – Mihai Todor Sep 20 '12 at 15:07
@Qnan, high values for `C` don't cause problems for soft-margin SVM, whether the data is separable or not. Training time may be higher, but the optimization problem is always feasible for any positive, finite value of `C`. – Marc Claesen Jul 25 '13 at 21:56

libsvm Shrinking Heuristics

1 Answers1

Linked