key phrases to SVM

Question

SVM newbie - I have 160 categories with varying from few to many membership terms and phrases, for training data. Some categories have few phrases, and others have hundreds.

I have lots of text testing data with a wide topical variety. I think I want a MultiClass, oneVsRest SVM, binary classifier.

1) Should the training input for 1 categories SVM be a set of lines with 1 feature3:1 feature5:1 ... for the positive membership, where feature is a term/phrase from the class membership list - is Binary value sufficient? and lines of -1 feature1:1 feature2:1 feature4:1... for all members of other classes in the dictionary of known_terms_of_interest?

2) Should the testing docs input only include terms found in the dictionary of known_terms_of_interest?

3) is linear correct? - C 1 ? or because there are few terms in some RBF?

It seems examples begin with preprocessed files and not raw text; so I'm missing the key setup placement steps, as the documentation goes into descriptions of margins and such.

score 0 · Answer 1 · answered Feb 01 '14 at 23:22

1) Should the training input for 1 categories SVM be a set of lines with 1 feature3:1 feature5:1 ... for the positive membership, where feature is a term/phrase from the class membership list - is Binary value sufficient? and lines of -1 feature1:1 feature2:1 feature4:1... for all members of other classes in the dictionary of known_terms_of_interest?

If your "featureX" is a natural number (index of your word/phrase) than you just described a valid set of words representation. It is the most basic approach to text classification, but it should work (in the sense - it is correct)

2) Should the testing docs input only include terms found in the dictionary of known_terms_of_interest?

They have to include only features (as before - as indexes) of words/phrases seen during the training phase. libsvm will fail to run if you provide it with never seen before features.

3) is linear correct? - C 1 ? or because there are few terms in some RBF?

There is no answer for such question, both type of kernel, and value of C (as well as gamma in case of RBF) have to be fitted using some generalization testing technique (like cross validation).

Would you expect adding term frequency data to improve the classification? The 16,000 terms/phrases in 160 classes are unambiguous, so I hoped a binary found or not would suffice. — jonquille, Feb 02 '14 at 00:59
this would change Set of Words to Bag of Words, next of the most basic approaches. — lejlot, Feb 02 '14 at 07:40

key phrases to SVM

1 Answers1