42

I found that scaling in SVM (Support Vector Machine) problems really improve its performance. I have read this explanation:

The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges.

Unfortunately this didn't help me. Can somebody provide a better explanation?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Kevin
  • 569
  • 1
  • 5
  • 12
  • Are you talking about log-normalizing data? – Leo Oct 06 '14 at 21:54
  • 4
    Maybe you should ask this question at http://stats.stackexchange.com/ - this forum is for programming questions, your questions sounds like a theoretical one – Leo Oct 06 '14 at 22:01

7 Answers7

62

Feature scaling is a general trick applied to optimization problems (not just SVM). The underline algorithm to solve the optimization problem of SVM is gradient descend. Andrew Ng has a great explanation in his coursera videos here.

I will illustrate the core ideas here (I borrow Andrew's slides). Suppose you have only two parameters and one of the parameters can take a relatively large range of values. Then the contour of the cost function can look like very tall and skinny ovals (see blue ovals below). Your gradients (the path of gradient is drawn in red) could take a long time and go back and forth to find the optimal solution.
enter image description here

Instead if your scaled your feature, the contour of the cost function might look like circles; then the gradient can take a much more straight path and achieve the optimal point much faster. enter image description here

greeness
  • 15,956
  • 5
  • 50
  • 80
  • Thank you so much greeness. Your answer is really clear but your answer explain why scaling improves computation speed time, not accuracy as I asked, in my humble opinion. Thank you! – Kevin Oct 07 '14 at 09:42
  • @Venik I think the reason for above is in his answer. I am not exactloy sure though: <> – Autonomous Oct 08 '14 at 05:11
  • This answer is not correct, SVM is not solved with SGD in most implementations, and the reason for feature scaling is completely different. – lejlot Nov 14 '14 at 00:55
  • 3
    I don't agree. To avoid the big values' dominating effect is probably the primary advantage. However, the author of libsvm also pointed out that feature scaling has the advantage of preventing numeric problems. see Section 2.2 http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf – greeness Nov 14 '14 at 08:31
  • I also don't know why you think gradient descent is not used to solve SVM in most implementations. In libsvm 's different versions, I see coordinate gradient descent and also sub-gradient descent's implementations. – greeness Nov 14 '14 at 08:36
  • Reasoning behind a "faster convergence" and a "direct path" to the local optima (rather, a local critical point) assumes that that the step sizes do not change! This assumption may not hold. For example, if step sizes become smaller with feature scaling, then it will take more steps (albeit made on a straight line) to get to critical point. – Oleg Melnikov Jan 27 '17 at 23:12
  • It means **it speeds up gradient descent by making it require fewer iterations to get to a good solution**. – Ishaan Jun 25 '19 at 06:56
49

The true reason behind scaling features in SVM is the fact, that this classifier is not affine transformation invariant. In other words, if you multiply one feature by a 1000 than a solution given by SVM will be completely different. It has nearly nothing to do with the underlying optimization techniques (although they are affected by these scales problems, they should still converge to global optimum).

Consider an example: you have man and a woman, encoded by their sex and height (two features). Let us assume a very simple case with such data:

0 -> man 1 -> woman

╔═════╦════════╗
║ sex ║ height ║
╠═════╬════════╣
║  1  ║  150   ║
╠═════╬════════╣
║  1  ║  160   ║
╠═════╬════════╣
║  1  ║  170   ║
╠═════╬════════╣
║  0  ║  180   ║
╠═════╬════════╣
║  0  ║  190   ║
╠═════╬════════╣
║  0  ║  200   ║
╚═════╩════════╝

And let us do something silly. Train it to predict the sex of the person, so we are trying to learn f(x,y)=x (ignoring second parameter).

It is easy to see, that for such data largest margin classifier will "cut" the plane horizontally somewhere around height "175", so once we get new sample "0 178" (a woman of 178cm height) we get the classification that she is a man.

However, if we scale down everything to [0,1] we get sth like

╔═════╦════════╗
║ sex ║ height ║
╠═════╬════════╣
║  1  ║  0.0   ║
╠═════╬════════╣
║  1  ║  0.2   ║
╠═════╬════════╣
║  1  ║  0.4   ║
╠═════╬════════╣
║  0  ║  0.6   ║
╠═════╬════════╣
║  0  ║  0.8   ║
╠═════╬════════╣
║  0  ║  1.0   ║
╚═════╩════════╝

and now largest margin classifier "cuts" the plane nearly vertically (as expected) and so given new sample "0 178" which is also scaled to around "0 0.56" we get that it is a woman (correct!)

So in general - scaling ensures that just because some features are big it won't lead to using them as a main predictor.

blkpingu
  • 1,556
  • 1
  • 18
  • 41
lejlot
  • 64,777
  • 8
  • 131
  • 164
  • 4
    Another intuitive example: Suppose we want to classify a group of people based on attributes such as height (measured in metres) and weight (measured in kilograms). The height attribute has a low variability, ranging from 1.5 m to 1.85 m, whereas the weight attribute may vary from 50 kg to 250 kg. If the scale of the attributes are not taken into consideration, the distance measure may be dominated by differences in the weights of a person. Source: Introduction to Data Mining, Chapter 5, Tan Pan-Ning – ruhong Apr 22 '18 at 02:01
  • I still don't understand why the network won't automatically scale the features. Won't the training just set the weights to scale the data for you? Like the height and weight example in these comments.. I would think the training would scale the low variability attributes with a large weight and the high variability features with a lower weight. Why wouldn't that happen? – Kevlar Sep 07 '18 at 22:28
  • To agree with the post after the first table, it looks to me as though the key should be 0-woman, 1-man, and the first table should be 0 150, 0 160, 0 170, 1 180, 1 190, 1 200. – Joffer Jan 14 '19 at 17:21
2

Just personal thoughts from another perspective.
1. why feature scaling influence?
There's a word in applying machine learning algorithm, 'garbage in, garbage out'. The more real reflection of your features, the more accuracy your algorithm will get. That applies too for how machine learning algorithms treat relationship between features. Different from human's brain, when machine learning algorithms do the classify for example, all the features are expressed and calculated by the same coordinate system, which in some sense, establish a priori assumption between the features(not really reflection of data itself). And also the nature of most algorithms is to find the most appropriate weight percentage between the features to fittest the data. So when these algorithms' input is unscaled features, large scale data has more influence on the weight. Actually it's not the reflection of data iteself.
2. why usually feature scaling improve the accuracy?
The common practice in unsupervised machine learning algorithms about the hyper-parameters(or hyper-hyper parameters) selection(for example, hierachical Dirichlet process, hLDA) is that you should not add any personal subjective assumption about data. The best way is just to assume that they have the equality probability to appear. I think it applies here too. The feature scaling just try to make the assumption that all the features has the equality opportunity to influence the weight, which more really reflects the information/knowledge you know about the data. Commonly also result in better accuracy.

BTW, about the affine transformation invariant and converge faster, there's are interest link here on stats.stackexchange.com.

Community
  • 1
  • 1
weiheng
  • 346
  • 4
  • 16
2

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. This is from Andrews NG coursera course.

So, it is done to do something like standardizing the data. Sometimes researchers want to know if a specific observation is common or exceptional. express a score in terms of the number of standard deviations it is removed from the mean. This number is what we call a z-score. If we recode original scores into z-scores, we say that we standardize a variable.

iali87
  • 147
  • 10
1

From what i have learnt from the Andrew Ng course on coursera is that feature scaling helps us to achieve the gradient decent more quickly,if the data is more spread out,that means if it has a higher standerd deviation,it will relatively take more time to calculate the gradient decent compared to the situation when we scale our data via feature scaling

Dude
  • 21
  • 1
  • 7
1

The Idea of scaling is to remove exess computes on a particular variable by standardising all the variable on to a same scale with this we tend to calculate the slope a lot more easier ( y = mx + c) where we are normalizing the M parameter to converge as quickly as possible.

Sree11
  • 11
  • 1
  • 3
1

Yes if normalisation is not there then contour will be skinny thus with normalisation:

  1. Values be within the range
  2. Speeds up the calculation of theta because number of calculations require will be less
Rob
  • 26,989
  • 16
  • 82
  • 98