I am solving a multilabel classification problem. I have about 6 Million of rows to be processed which are huge chunks of text. They are tagged with multiple tags in a separate column.
Any advice on what scikit libraries can help me scale up my code. I am using One-vs-Rest and SVM within it. But they don't scale beyond 90-100k rows.
classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])