5

Scikit-learn CountVectorizer for bag-of-words approach currently gives two sub-options: (a) use a custom vocabulary (b) if custom vocabulary is unavailable, then it makes a vocabulary based on all the words present in the corpus.

My question: Can we specify a custom vocabulary to begin with, but ensure that it gets updated when new words are seen while processing the corpus. I am assuming this is doable since the matrix is stored via a sparse representation.

Usefulness: It will help in cases when one has to add additional documents to the training data, and one should not have to start from the beginning.

  • This can't be done with scikit as it is written now, so only option I see is to add a enhancement request to the [issue tracker](https://github.com/scikit-learn/scikit-learn/issues). – alko Dec 10 '13 at 13:51

1 Answers1

2

No, this is not possible at present. It's also not "doable", and here's why.

CountVectorizer and TfidfVectorizer are designed to turn text documents into vectors. These vectors need to all have an equal number of elements, which in turn is equal to the size of the vocabulary, because that conventions is ingrained in all scikit-learn code. If the vocabulary is allowed to grow, then the vectors produced at various times have different lengths. This affects e.g. the number of parameters in a linear (or other parametric) classifiers trained on such vectors, which then also needs to be able to grow. It affects k-means and dimensionality reduction classes. It even affects something as simple as matrix multiplications, which can no longer be handled with a simple call to NumPy's dot routine, requiring custom code instead. In other words, allowing this flexibility in the vectorizers makes little sense unless you adapt all of scikit-learn to handle the result.

While this would be possible, I (as a core scikit-learn developer) would strongly oppose the change because it makes the code very complicated, probably slower, and even if it would work, it would make it impossible to distinguish between a "growing vocabulary" and the much more common situation of a user passing data in the wrong way, so that the number of dimensions comes out wrong.

If you want to feed data in in batches, then either using a HashingVectorizer (no vocabulary) or do two passes over the data to collect the vocabulary up front.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Older vectors can be updated by extending with zeros for the new terms that didn't occur in them. – Cory Mar 15 '17 at 23:02
  • I would find the vocabulary management and tokenization features of the `CountVectorizer` and `TfidfVectorizer` useful on their own, it would be nice if the vectorization was separate. – Cory Mar 15 '17 at 23:03
  • 1
    I think this answer comes down too strong on the side of "not doable". In essence, the vectorizers in scikit-learn do grow in size as the corpus is being processed. That is, if no pre-built vocabulary is given, then they start at size 0 and grow to size N as the corpus is being processed. Starting with a vocabularly of size M should be possible. – robguinness Oct 30 '18 at 09:49