0

Hello all machine learning experts, I am naive in machine learning topics. My data have six features(6 regular attributes) and 2 labels(1 special attribute)(true and false)(hope I used right term). I want to combine those features which has to be trained by SVM. Data looks like that:-

ZDis       ZAnch     ZSurf     Zval     ZDom     ZEntropy  Top5
0.48659   -0.20412  1.19243   0.15374  0.59667   1.34151   False
-0.10067  4.89898   -0.73677  0.22506  0.59667   1.34151   True
2.24837   -0.20412  -2.02291  0.22455  0.59667   1.34151   False
0.48659   -0.20412  1.19243   -0.06352 0.59667   1.34151   False
-0.68793  -0.20412  1.19243   0.12405  0.59667   1.34151   False
-2.02698  -0.40825  1.86371   0.07348  1.3272    -0.1242   False
-0.1807   2.44949   0.17865   0.07345  0.9401    0.1505    False
1.66557   2.44949   -1.50641  0.07381  0.9401    1.30135   False
1.11169   -0.40825  0.34716   0.07381  0.9401    -0.20225  True
1.5337    -0.40825  -0.01393  0.07381  -0.9954   0.53144   False
-0.01945  -0.48348  -1.16128  0.11035  2.02339   0.90237   False
-1.52944   3.23556  0.23428   0.11093  1.22613   -0.12973  False
0.43354   -0.48348  -2.20795  0.11093  1.22613   2.25734   False
2.84953   -0.48348  -2.20795  0.11093  1.49189   3.07609   True

So I want to do here total = X1*ZDis+X2*ZAnch+X3*ZSurf+X4*Zval+X5*ZDom+X6*ZEntropy where X1..X6 are weighted value which should come from SVM. I used rapidminner to to get this weight value for my 40 examples of training set and result is below:-

Total number of Support Vectors: 40
Bias (offset): -1.055
w[ZDis] = 0.076
w[ZAnch] = -0.058
w[ZSurf] = 0.057
w[Zval] = 0.010
w[ZDom] = 0.073
w[ZEntropy] = 0.077

I am not sure I did the correct approach or not so I need your kind help. Thanks in advance. Also if someone guide me how to write code on this SVM problem in python that will be helpful for me too.

Thanks Pallab


After getting feedback from you , I did some analysis again for my problem , where I have 277 datasets and 8 are positive and 269 are negative with 8 features so its showing me clearly, its imbalance dataset. as I told before, I want to give equal importance to all my features using SVM by SVM weight and then want to do ( w1*x1+w2*x2+...+w8*x8) and which will help me to extract true result from my dataset. The data is like:-

`NameOfMotif eval_Zscore dis_Zscore abind_Zscore surf_Zscore pfam_Zscore ptm_Zscore coil_Zscore entropy_Zscore TrueVsFalse
ptk_9 0.77428 0.2387 -0.39736 1.48274 0.61237 -0.21822 0.49111 0.44599 False
ptk_8 0.77494 -0.97317 -0.39736 -0.27357 -1.63299 -0.21822 0.6181 -0.04028 False
ptk_3 0.77591 1.45058 -0.39736 -0.1139 0.61237 4.58258 0.74509 -0.85069 True
ptk_6 0.77583 -2.18505 -0.39736 -0.27357 0.61237 -0.21822 -0.3343 -0.92281 False
ptk_22 0.55932 1.45058 -0.39736 0.70216 0.61237 -0.21822 1.25303 -2.17556 False
ptk_23 0.51159 -0.97317 -0.39736 1.05697 -1.63299 -0.21822 1.25303 0.77021 False
ptk_20 0.62907 0.2387 -0.39736 1.05697 0.61237 -0.21822 -0.22848 -1.21702 False
..............................................................................
scf-trcp1_1 0.17425 2.23675 -0.92125 -0.03478 1.20877 5.13288 1.31262 2.27655 True
scf-trcp1_3 0.17425 -1.068 -0.92125 -0.82472 -2.43745 -0.43743 0.48341 -0.59339 False
scf-trcp1_5 0.17425 0.41914 0.24523 -1.05041 0.23644 -0.43743 -0.02919 1.68523 False
scf-trcp1_7 0.17425 -1.63453 -0.92125 -1.25354 -1.82975 -0.43743 -2.0193 0.95051 False`

and my svm out put is

kernel type polynomial
cross fold validation =5
c=100000.0
kernal degree = 1.0E-4
L-pos =2.0
L-neg =2.0
PerformanceVector:
accuracy: 84.60% +/- 23.58% (mikro: 84.48%)
ConfusionMatrix:
True:   False   True
False:  228 2
True:   41  6
precision: 31.08% +/- 25.51% (mikro: 12.77%) (positive class: True)
ConfusionMatrix:
True:   False   True
False:  228 2
True:   41  6
recall: 70.00% +/- 40.00% (mikro: 75.00%) (positive class: True)
ConfusionMatrix:
True:   False   True
False:  228 2
True:   41  6
AUC (optimistic): 0.793 +/- 0.184 (mikro: 0.793) (positive class: True)
AUC: 0.793 +/- 0.184 (mikro: 0.793) (positive class: True)
AUC (pessimistic): 0.793 +/- 0.184 (mikro: 0.793) (positive class: True)

My question is here, my approach is good enough now? all parameter I used to optimize SVM is fine ? I am very much naive in this issue!! thanks Pallab

ChrisF
  • 134,786
  • 31
  • 255
  • 325
Paul85
  • 647
  • 1
  • 11
  • 27
  • Why are you not sure if this is the correct approach? Did you test your parameters on test data? – Eric Conner Oct 01 '13 at 13:33
  • I checked on my test data , specially I took out 5 examples from my training set (now it contains only 35 examples) where in test set, 4 are false and 1 is true but rapisminner gives me 5 false results!! – Paul85 Oct 01 '13 at 14:05

2 Answers2

0

You are using a linear model, you assume that there exists a set of parameters that will give you the answer by simply calculating sign( w1*x1+w2*x2+...+w5*x5 - b). Such assumptions is rarely the case for the low dimensional spaces. In your particular example you have just 5 dimensions and very small training set. Witch such small data - there is almost no chance that any machine learning approach will make a good results, as they are all statistical methods. It is hard to talk about the statistics of 30 elements.

To the questions:

  • In order to try out this in python, take a look at scikit-learn
  • To test your model perform a cross validation - split your data into for example 5 chunks (each of 7 examples), then train your SVM on 4 of such chunks (28 points) and test on remaining 1 chunk (7 points), repeat 5 times, so each chunk is used exactly once as a testing. Compute the average of the resulting accuracy
  • To deal with low-dimensional non linearly-separable data try to use some other kernels like polynomial (with small degree) or if it does not work - RBF.
  • Remember that SVM is a parametric model. You have to choose the correct parameters in order to get good results. Linear SVM requires C parameter - the bigger the C, the more you "force" SVM to classify data correctly (minimize the number of missclassifications). When using kernels - you get another parameters (so besides C you get d in polynomial and gamma in rbf). Choice of best parameters can be performed used grid search (scikit-learn has routines to automate this, read the documentations)
  • Data standarization - it is a common knowledge, that many ML models (including SVM) can perform badly on data where each feature has different scale - it seems that it is your case (Zval seems to be much smaller than Zentropy) - to avoid feature bias you should rescale them to for example [-1,1] intervals, or normalize so each have mean 0 and variance 1
lejlot
  • 64,777
  • 8
  • 131
  • 164
0

You mention that by holding out 5 records you obtained 5 false classifications of which 4 were correct and 1 was incorrect. This is not enough information to know whether the model is any good. As the previous answer says, estimate the performance of the SVM on unseen data by doing cross validation (the RapidMiner operator is called X-Validation). This will give you a view as to whether the model has any value at all. To tune the parameters to the SVM operator to improve the model, use the Loop Parameters operator and combine it with cross validation to get the estimated performance.

Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41