0

I am working on a university project to detect letters from a photo. I can successfully extract words from the photo, cut them into single letters which are black an a white background. These pictures look quite clear.

I have trained the SVC classificator from the Python scikit library as follows:

classifier = svm.SVC(gamma=0.001)

It has been trained on about 800 letters which where obtained by me from words using my scripts. The classifier predicts letters very well when it works on letters on which it was trained. However, when I provide a new letter obtained with the same script from a different word, it fails every single time. Old and new examples seems to look very similar.

Can you give me any tips on how to improve this situation?

I have also trained this classsificator on 26k letters from the ready-made subset available online. The result was the same - perfect on training data, fail on new data.

Ghostwriter
  • 2,461
  • 2
  • 16
  • 18
  • Have you tried to visualize your dataset and classification results? – Roman Susi Nov 22 '15 at 16:31
  • I mean to look at some projections for your points with respect to hyperplanes your classifier finds. What "look very similar" may be very far away in high dimension space. Maybe, you have too many dimensions, overfitting ( http://stats.stackexchange.com/questions/35276/svm-overfitting-curse-of-dimensionality )? Also, you have not explained if words play any role in your experiment or it is just letters which matter. – Roman Susi Nov 22 '15 at 16:57
  • Actually, the best advice is to give more information on your model, because otherwise it's all wild guessing. – Roman Susi Nov 23 '15 at 07:23

1 Answers1

1

The classifier predicts letters very well when it works on letters on which it was trained. However, when I provide a new letter obtained with the same script from a different word, it fails every single time.

This sounds like classic over-fitting, which means that the gamma parameter you chose (as well as the C parameter that you left at its default value) is not optimal for your data.

In general, you should choose these parameters through a cross-validation/grid search, rather than just choosing them arbitrarily — their value can vastly change the performance of your model, especially for SVMs.

You would likely benefit from reading through the Model Selection and Evaluation section of scikit-learn's documentation, and following the advice there.

jakevdp
  • 77,104
  • 11
  • 125
  • 160