43

I have been working with the CountVectorizer class in scikit-learn.

I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.

These tokens are extracted from a set of keywords, i.e.

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

The next step is:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

Where we get

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

This is fine, but my situation is just a little bit different.

I want to extract the features the same way as above, but I don't want the rows in data to be the same documents that the features were extracted from.

In other words, how can I get counts of another set of documents, say,

list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]
]

And get:

[[0 0 0 1 0 0]
 [0 1 0 0 0 1]
 [0 0 0 0 0 0]]

I have read the documentation for the CountVectorizer class, and came across the vocabulary argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.

Any advice is appreciated.
PS: all credit due to Matthias Friedrich's Blog for the example I used above.

stensootla
  • 13,945
  • 9
  • 45
  • 68
tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149

3 Answers3

55

You're right that vocabulary is what you want. It works like this:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=int64)

So you pass it a dict with your desired features as the keys.

If you used CountVectorizer on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_ attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

to create a new tokenizer using the vocabulary from your first one.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Thanks, this looks great! For the first solution: should the vocabulary always be a dict, not list? Correct me if I'm wrong, but the counts (0, 1, 2) seem irrelevant. The second method you outlined looks perhaps a little clearer. – tumultous_rooster Apr 07 '14 at 19:20
  • 1
    @MattO'Brien: You're right, it can be a list, I misread the documentation. I edited my answer. In the second method, though, it is a dict, because that's what the `vocabulary_` method of a fitted vectorizer is. – BrenBarn Apr 07 '14 at 19:25
  • 1
    BrenBarn, your answer saved me a lot of time. Seriously. Thanks for being on this site. – tumultous_rooster Apr 07 '14 at 20:49
  • 10
    Maybe I'm not understanding something, but rather than initializing a new `CountVectorizer` with the original vocabulary, could you not just call `.transform()` on the new document set with the original vectorizer? – Fred Jun 30 '15 at 03:52
  • I have a large list (n > 10000)of strings in which each string contains 100K to 110K words. how to make this countVectorizer fast for such data. Is this works by using all cores – Asis Jan 21 '20 at 18:22
  • for anybody having problem with the `vocabulary` param, `CountVectorizer()` has a `lowercase=True` default setting so you have to set it to `lowercase=False` if your custom vocabulary list is uppercase. – NatalieL Sep 29 '21 at 21:57
9

You should call fit_transform or just fit on your original vocabulary source so that the vectorizer learns a vocab.

Then you can use this fit vectorizer on any new data source via the transform() method.

You can obtain the vocabulary produced by the fit (i.e. mapping of word to token ID) via vectorizer.vocabulary_ (assuming you name your CountVectorizer the name vectorizer.

Dhruv Ghulati
  • 2,976
  • 3
  • 35
  • 51
3
>>> tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

>>> list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]

]

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vect = CountVectorizer()
>>> tags = vect.fit_transform(tags)

# vocabulary learned by CountVectorizer (vect)
>>> print(vect.vocabulary_)
{'python': 3, 'tools': 5, 'linux': 1, 'ubuntu': 6, 'distributed': 0, 'systems': 4, 'networking': 2}

# counts for tags
>>> tags.toarray()
array([[0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 0, 1, 1, 0]], dtype=int64)

# to use `transform`, `list_of_new_documents` should be a list of strings 
# `itertools.chain` flattens shallow lists more efficiently than list comprehensions

>>> from itertools import chain
>>> new_docs = list(chain.from_iterable(list_of_new_documents)
>>> new_docs = vect.transform(new_docs)

# finally, counts for new_docs!
>>> new_docs.toarray()
array([[0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0]])

To verify that CountVectorizer is using the vocabulary learned from tags on new_docs: print vect.vocabulary_ again or compare the output of new_docs.toarray() to that of tags.toarray()

user2476665
  • 353
  • 2
  • 3
  • 11