0

I try classify documents based on their bag of words representation (Features: 1000). For the classification, I am using a SVM, it seems that sometimes the SVM doesn't terminate and runs endlessly. (Running sci-kit: SVC(C=1.0,kernel='linear', cache_size=5000, verbose=True)) Now I am searching for a solution, I was thinking about to apply a MinMax-Scaler to get a computation efficient document representation. But do I screw up my bag of word representation with the feature normalization?

Thanks in advance!

jobooo
  • 11
  • 4

1 Answers1

0

It does terminate, simply quite slowly. Scaling your bag of words will not "screw" anything - actually it is extremely common technique, you will rather rarely a model which uses bag of words - you either use set of words (which is scaled by definition) or som scale normalized bag of words, such as tf-idf (which is usually better than just "squashing" through min max). In general minmax is very rough technique, extremely sensitive to outliers (thus if you have a document consisting of 1000 occurences of word "foo" yours "foo" dimension will be squashed by 1000, even though it is just a single outlier). Consequently - rather prefer tfidf or at least standard scaler.

lejlot
  • 64,777
  • 8
  • 131
  • 164