Get most informative features from very simple scikit-learn SVM classifier

Question

I tried to build the a very simple SVM predictor that I would understand with my basic python knowledge. As my code looks so different from this question and also this question I don't know how I can find the most important features for SVM prediction in my example.

I have the following 'sample' containing features and class (status):

A B C D E F  status
1 5 2 5 1 3  1
1 2 3 2 2 1  0
3 4 2 3 5 1  1
1 2 2 1 1 4  0

I saved the feature names as 'features':

A B C D E F

The features 'X':

1 5 2 5 1 3  
1 2 3 2 2 1 
3 4 2 3 5 1  
1 2 2 1 1 4

And the status 'y':

Then I build X and y arrays out of the sample, train & test on half of the sample and count the correct predictions.

import pandas as pd
import numpy as np
from sklearn import svm

X = np.array(sample[features].values)
X = preprocessing.scale(X)    
X = np.array(X)
y = sample['status'].values.tolist()
y = np.array(y)

test_size = int(X.shape[0]/2)

clf = svm.SVC(kernel="linear", C= 1)
clf.fit(X[:-test_size],y[:-test_size])

correct_count = 0   

for x in range(1, test_size+1):
    if clf.predict(X[-x].reshape(-1, len(features)))[0] == y[-x]:
        correct_count += 1
accuracy = (float(correct_count)/test_size) * 100.00

My problem is now, that I have no idea, how I could implement the code from the questions above so that I could also see, which ones are the most important features.

I would be grateful if you could tell me, if that's even possible for my simple version? And if yes, any tipps on how to do it would be great.

score 0 · Answer 1 · answered Aug 23 '16 at 09:21

0

From all feature set, the set of variables which produces the lowest values for square of norm of vector must be chosen as variables of high importance in order

answered Aug 23 '16 at 09:21

Raunak Jhawar

1,541
1
12
21

Unfortunately I don't really understand what you mean as my question is exactly, how can I find these variables? – Don Aug 23 '16 at 09:23
To build your training set X, you can run iterations of SVM classifier with different combinations of variables for every iteration and choose the X which produces the best classification. The combination of X,y which produces best fit is your best selection of variables. Unfortunately, there is no direct way to determine set of variables of high importance with just SVM (or SVC) – Raunak Jhawar Aug 23 '16 at 09:29
ok, so I would need to make a loop which makes the following: 1. random selection of features 2. run svm 3. save achieved accuracy with these features. Then I loop many many times and try to find the features that are connected with the highest accuracies? Isn't the interaction between the features too high that I could determine it like this? – Don Aug 23 '16 at 09:45
Yes this is right. SVM alone are not used for feature selection. There are sophisticated procedures and techniques available for "feature selection" and to determine a correlation between variables which of course you may then implement PCA or likewise to feature engineer. You should first select your set of important features & then run SVM's – Raunak Jhawar Aug 23 '16 at 10:04
ok thanks for the reply! Could you perhaps also then tell me how the methods work that I linked to in my question above? – Don Aug 23 '16 at 13:43
two things contribute best for variable selection - knowledge about the data set and machine learning techniques. To begin with, you can perhaps run few CV iterations with different X's and evaluate the accuracy of the model. Otherwise, you may consider PCA, but with low dimensioanlity, PCA is overkill. – Raunak Jhawar Aug 24 '16 at 07:19

Get most informative features from very simple scikit-learn SVM classifier

1 Answers1