No, this is not possible at present. It's also not "doable", and here's why.
CountVectorizer
and TfidfVectorizer
are designed to turn text documents into vectors. These vectors need to all have an equal number of elements, which in turn is equal to the size of the vocabulary, because that conventions is ingrained in all scikit-learn code. If the vocabulary is allowed to grow, then the vectors produced at various times have different lengths. This affects e.g. the number of parameters in a linear (or other parametric) classifiers trained on such vectors, which then also needs to be able to grow. It affects k-means and dimensionality reduction classes. It even affects something as simple as matrix multiplications, which can no longer be handled with a simple call to NumPy's dot
routine, requiring custom code instead. In other words, allowing this flexibility in the vectorizers makes little sense unless you adapt all of scikit-learn to handle the result.
While this would be possible, I (as a core scikit-learn developer) would strongly oppose the change because it makes the code very complicated, probably slower, and even if it would work, it would make it impossible to distinguish between a "growing vocabulary" and the much more common situation of a user passing data in the wrong way, so that the number of dimensions comes out wrong.
If you want to feed data in in batches, then either using a HashingVectorizer
(no vocabulary) or do two passes over the data to collect the vocabulary up front.