Multi-label classification for large dataset

Question

I am solving a multilabel classification problem. I have about 6 Million of rows to be processed which are huge chunks of text. They are tagged with multiple tags in a separate column.

Any advice on what scikit libraries can help me scale up my code. I am using One-vs-Rest and SVM within it. But they don't scale beyond 90-100k rows.

classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1)), 
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

Try training an `SGDClassifier` per label using its `partial_fit` API. Also consider using `HashingVectorizer` instead of counting + tf-idf. — Fred Foo, Nov 29 '13 at 09:45
Is it necessary to scale or normalize the output of `HashingVectorizer` or can it be directly fed into the `SGDClassifier`? — Matt, Dec 02 '13 at 12:16
Have you considered switching to a random forest classifier? It scales much better than SVM does. — nojka_kruva, Jan 06 '14 at 15:20

score 3 · Answer 1 · answered Mar 20 '14 at 16:59

SVM's scale well as the number of columns increase, but poorly with the number of rows, as they are essentially learning which rows constitute the support vectors. I have seen this as a common complaint with SVM's, but most people don't understand why, as they typically scale well for most reasonable datasets.

You will want 1 vs the rest, as you are using. One vs One will not scale well for this (n(n-1) classifiers, vs n).
I set a minimum df for the terms you consider to at least 5, maybe higher, which will drastically reduce your row size. You will find a lot of words occur once or twice, and they add no value to your classification as at that frequency, an algorithm cannot possibly generalize. Stemming may help there.
Also remove stop words (the, a, an, prepositions, etc, look on google). That will further cut down the number of columns.
Once you have reduced your column size as described, I would try to eliminate some rows. If there are documents that are very noisy, or very short after steps 1-3, or maybe very long, I would look to eliminate them. Look at the s.d. and mean doc length, and plot the length of the docs (in terms of word count) against the frequency at that length to decide
If the dataset is still too large, I would suggest a decision tree, or naive bayes, both are present in sklearn. DT's scale very well. I would set a depth threshold to limit the depth of the tree, as otherwise it will try to grow a humungous tree to memorize that dataset. NB on the other hand is very fast to train and handles large numbers of columns quite well. If the DT works well, you can try RF with a small number of trees, and leverage the ipython parallelization to multi-thread.
Alternatively, segment your data into smaller datasets, train a classifier on each, persist that to disk, and then build an ensemble classifier from those classifiers.

score 0 · Answer 2 · answered May 08 '14 at 23:45

HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.

You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.

You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.

You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.

I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736

There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial

Multi-label classification for large dataset

2 Answers2