0

SVM newbie - I have 160 categories with varying from few to many membership terms and phrases, for training data. Some categories have few phrases, and others have hundreds.

I have lots of text testing data with a wide topical variety. I think I want a MultiClass, oneVsRest SVM, binary classifier.

1) Should the training input for 1 categories SVM be a set of lines with 1 feature3:1 feature5:1 ... for the positive membership, where feature is a term/phrase from the class membership list - is Binary value sufficient? and lines of -1 feature1:1 feature2:1 feature4:1... for all members of other classes in the dictionary of known_terms_of_interest?

2) Should the testing docs input only include terms found in the dictionary of known_terms_of_interest?

3) is linear correct? - C 1 ? or because there are few terms in some RBF?

It seems examples begin with preprocessed files and not raw text; so I'm missing the key setup placement steps, as the documentation goes into descriptions of margins and such.

jonquille
  • 3
  • 1

1 Answers1

0

1) Should the training input for 1 categories SVM be a set of lines with 1 feature3:1 feature5:1 ... for the positive membership, where feature is a term/phrase from the class membership list - is Binary value sufficient? and lines of -1 feature1:1 feature2:1 feature4:1... for all members of other classes in the dictionary of known_terms_of_interest?

If your "featureX" is a natural number (index of your word/phrase) than you just described a valid set of words representation. It is the most basic approach to text classification, but it should work (in the sense - it is correct)

2) Should the testing docs input only include terms found in the dictionary of known_terms_of_interest?

They have to include only features (as before - as indexes) of words/phrases seen during the training phase. libsvm will fail to run if you provide it with never seen before features.

3) is linear correct? - C 1 ? or because there are few terms in some RBF?

There is no answer for such question, both type of kernel, and value of C (as well as gamma in case of RBF) have to be fitted using some generalization testing technique (like cross validation).

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Would you expect adding term frequency data to improve the classification? The 16,000 terms/phrases in 160 classes are unambiguous, so I hoped a binary found or not would suffice. – jonquille Feb 02 '14 at 00:59
  • this would change Set of Words to Bag of Words, next of the most basic approaches. – lejlot Feb 02 '14 at 07:40