11

I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?

Labels    features    frequency
  A        'book'        54
  B       'movies'       32

Any help is appreciated.

user1906856
  • 141
  • 2
  • 2
  • 4

1 Answers1

21

Have a look at the documentation on text feature extraction.

Also have a look at the text classification example.

There is also a tutorial here:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 1
    multinomial naive bayes / SVM both will work for you. – Divyang Shah Nov 21 '14 at 08:40
  • the link to the `text classification example` is 404 – Alex Plugaru Apr 09 '15 at 09:48
  • Thanks for the report I fixed the broken link. – ogrisel Apr 09 '15 at 15:37
  • @ogrisel: I am trying with 10 classes using naive bayes, but not satisfied with the result. svm is good fit if dataset is small, each class of around 100 sentences – user123 Apr 30 '15 at 11:29
  • 1
    For small number of samples (e.g. less than 10000 samples or so), `SVC(kernel='linear')` might be fast enough to converge. However it should give similar predictive performance as `LinearSVC` and comparable performance to `LogisticRegression` that should be both faster and can scale to hundreds of thousands of samples . For each case you need to pick the best value for C via cross-validation. Furthermore `LogisticRegression` provides good probability estimates by default (with the `predict_proba` method). This is why I advise you to use linear models over the generic `SVC` by default. – ogrisel May 05 '15 at 14:12
  • SVC is only really interesting for non-linearly separable data with `kernel='rbf'` for problems with less than a couple of thousands of samples (because of more than quadratic time complexity of its solver). Text classification problems tend to be almost linearly separable, especially if you use bi-gram features. Therefore LinearSVC and LogisticRegression tend to be better choices for this class of problems. – ogrisel May 05 '15 at 14:14