I am working on a texture classifier using scikit-learn svm svc as a model. I am a bit confused about some of the results I get, mainly why the choices of parameters gamma and C have such a big influence on my results. So here is a short description of my case:
In total I have 195 images divided into 8 classes as follows; wood - 56, marble - 23, cement - 16, concrete - 7, tile - 32, carpet - 18, brick - 25, fabric - 18.
I randomly divide my data into training and test set by a 80-20 ratio.
I use SVC as a model: model = SVC(C=C, random_state=42, gamma=gamma)
I fit my model with the training data. Next I predict classes for my test data and use accuracy (ratio of correct predictions) as a score.
For different choices of C and gamma my score looks as follows:
For C=100 and varying gamma:
| gamma | 1e-3|1e-2|1e-1| 1e0| 1e1| 1e2| 1e3| 1e4|
| score | 0.28|0.28|0.29|0.36|0.54|0.76|1.00|1.00|
And for fixed gamma=10 and varying C
| C |1e-1| 1e0| 1e1| 1e2| 1e3| 1e4| 1e5| 1e6|
| score | 0.28|0.29|0.37|0.54|0.67|0.83|0.97|1.00|
As seen the score increases as both C and gamma increases.
First of all, I think the result for large C and gamma is too 'perfect'. It has a 100% accuracy. As I understand the influence of C, the larger the C the more exactly does it fit the train data, but the less 'smooth' is the model. I guess you could interpret the result as train and test data being very similiar, and therefore a large value of C will result in a large accuracy. If we consider gamma as the inverse influencial radius of each train item, it too concludes that the train and test data might be very similiar.
On the other hand, when I use GridSearchCV
to tune the parameters, I get the result:
The best The best parameters are {'C': 100000.0, 'gamma': 10.0} with a score of 0.46
How can I interpret this score of 0.46? To me it seems very low compared to the accuracy result I got when I tested.