-1

I am trying to classify data about 5000 records with about 1000 truth values into 2 classes using an SVM. My code is from the example as below:

from sklearn import svm
clf = svm.SVC()
clf.fit(X, Y)

so I am using most of the default values. The variance is very high for me. The training accuracy is more than 95% while the test I am doing extracting about 50 records from the data set is 50%.

However if I change the size of the training of test data to about 3000 and 2000 records then the training accuracy drops to 80% and the test accuracy goes up. Why is this happening?

Now if I change the scikit-learn library to logistic regression then the percentages remain unchanged. Why is that so?

user 923227
  • 2,528
  • 4
  • 27
  • 46

1 Answers1

1

every modification on svm creates a new accuracy, remember that the accuracy for training data is not the same as the accuracy on the actual data. if you are aiming for a high accuracy on both i suggest you try cleaning the data first.

Patrick
  • 11
  • 2
  • Hi @Patrick, thanks you for your response. This data has been selected from db based on static tests. Should I remove the values > 5σ? Do I scale the numbers as well (X¡ - µ)/(max - min)? – user 923227 Sep 13 '18 at 23:05
  • [Looks like this post has the scaling information](https://stackoverflow.com/questions/14688391/how-to-apply-standardization-to-svms-in-scikit-learn?rq=1) – user 923227 Sep 14 '18 at 04:57
  • Hi @Patrick, Data is expected to be cleaned already with 5σ values removed. I did scaling, changed the data split to 2000 ( 1440 True, 560 False ) data points and 2510 (2350 True, 160 False) Test points. The data is expected to have 20% - 30% False. This reduced the variance. now I am getting 73.39% training accuracy and 74.77 test accuracy. Can this be considered an acceptable model? – user 923227 Sep 18 '18 at 21:30