13

While libsvm provides tools for scaling data, with Scikit-Learn (which should be based upon libSVM for the SVC classifier) I find no way to scale my data.

Basically I want to use 4 features, of which 3 range from 0 to 1 and the last one is a "big" highly variable number.

If I include the fourth feature in libSVM (using the easy.py script which scales my data automatically) I get some very nice results (96% accuracy). If I include the fourth variable in Scikit-Learn the accuracy drops to ~78% - but if I exclude it, I get the same results I get in libSVM when excluding that feature. Therefore I am pretty sure it's a problem of missing scaling.

How do I replicate programmatically (i.e. without calling svm-scale) the scaling process of SVM?

Maehler
  • 6,111
  • 1
  • 41
  • 46
luke14free
  • 2,529
  • 1
  • 17
  • 25

2 Answers2

9

You have that functionality in sklearn.preprocessing:

>>> from sklearn import preprocessing
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

The data will then have zero mean and unit variance.

Maehler
  • 6,111
  • 1
  • 41
  • 46
  • Good to know, thanks. Should I standardize the test data together with the train data and slice them afterwards or should I only perform test data by itself? – luke14free Nov 10 '12 at 17:35
  • 3
    That is mentioned in the [documentation](http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling). I guess you should do it separately, otherwise the training data would be influenced by the test samples. With the `Scaler` class you can calculate the mean and standard deviation of the training data and then apply the same transformation to the test data. – Maehler Nov 10 '12 at 17:50
  • 8
    You should use a `Scaler` for this, not the freestanding function `scale`. A `Scaler` can be plugged into a `Pipeline`, e.g. `scaling_svm = Pipeline([("scaler", Scaler()), ("svm", SVC(C=1000))])`. – Fred Foo Nov 11 '12 at 15:03
  • 1
    Does the `Scaler` do standardization separately to training and testing data in `Pipeline`? Or it firstly standardize the whole data set before feeding to `svm`? – Francis Apr 18 '15 at 09:32
0

You can also try StandardScalerfor datascaling :

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(Xtrain) # where X is your data to be scaled
Xtrain = scaler.transform(Xtrain)
Steffi Keran Rani J
  • 3,667
  • 4
  • 34
  • 56