I have a bunch of text documents that I throw at a tfidf vectorizer which I further use for multi-label text classification. I will keep getting a stream of more documents in the future. Now how do I add new words to the vectorizer it has never seen before without retraining the model from scratch? Is partial_fit the only option, cause OvR and pipeline don't work with it? Here is the link I am talking about online learning of text documents.
Asked
Active
Viewed 1,667 times
1
-
1Here's a related question: https://stackoverflow.com/questions/39109743/adding-new-text-to-sklearn-tfidif-vectorizer-python/39114555#39114555 – Metropolis Mar 06 '18 at 15:39
-
The thing to consider here is when new data comes, it may (will) happen that TfidfVectorizer gets new features (vocabulary words) which will then change the shape of your feature vector (no. of columns will increase). Then you have to retrain the classification model. No model in scikit-learn will work with changed no. of features even with `partial_fit()`. – Vivek Kumar Mar 07 '18 at 05:01
-
So even if you can save some time using the TfidfVectorizer patch from the question tagged by @Metropolis, you will still need to re-train your model with all the data due to increased number of columns. – Vivek Kumar Mar 07 '18 at 05:09
-
Thanks for helping me out. Generally speaking, my situation is this; I have a bunch of data that I first train my model on (not big at all). Later, new data gets generated, could be X's, or Y's, or both. Should I retrain when new data comes in from scratch (i.e. all previous and present data passes through tfidf and then a Naive bayes model), or should I persist them in some form, do incremental training with partial_fit(). I really don't know to go about this, please help. – Shiva Kumar Mar 07 '18 at 14:48
-
1Maybe one can use HashingVecotrizer() here, not sure though? Does this mean that I first do a fit() on my docs using HashingVectorizer(), then pass this to Naive Bayes() and call fit(), and for future data call partial_fit() on both for that chunk. The sklearn docs (linked in the question) seem to only call transform on text. What do I do? – Shiva Kumar Mar 07 '18 at 21:18
-
HashingVectorizer is a good alternative. It will not change the feature size on the new data and hence can be used with partial_fit on the NaiveBayes estimator. – Vivek Kumar Mar 08 '18 at 07:17
-
Using a hashing_vecotrizer (a) with a Multi-Output classifier with partial_fit() works for my case. But I still can't wrap my head around what exactly (a) does? I know it doesn't learn the vocabulary, so what does it do to when new data comes in? does it learn something off of it? I maybe thinking it takes the unseen words, applied a hash func and spits a number, but I don't how that helps the classification model? – Shiva Kumar Mar 09 '18 at 21:29