Text classification + Bag of words + Python : Bag of words doesn't show document index

Question

I have written the following code to produce bag of words:

count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(data['description'].values.astype('U'))
vocab = count_vect.get_feature_names()
print(type(final_counts)) #final_counts is a sparse matrix
print("--------------------------------------------------------------")
print(final_counts.shape)
print("--------------------------------------------------------------")
print(final_counts.toarray())
print("--------------------------------------------------------------")
print(final_counts[769].shape)
print("--------------------------------------------------------------")
print(final_counts[769])
print("--------------------------------------------------------------")
print(final_counts[769].toarray())
print("--------------------------------------------------------------")
print(len(vocab))
print("--------------------------------------------------------------")

I am getting following output:

<class 'scipy.sparse.csr.csr_matrix'>
--------------------------------------------------------------
(770, 10252)
--------------------------------------------------------------
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
--------------------------------------------------------------
(1, 10252)
--------------------------------------------------------------
  (0, 4819) 1
  (0, 2758) 1
  (0, 3854) 2
  (0, 3987) 1
  (0, 1188) 1
  (0, 3233) 1
  (0, 981)  1
  (0, 10065)    1
  (0, 9811) 1
  (0, 8932) 1
  (0, 9599) 1
  (0, 10150)    1
  (0, 7716) 1
  (0, 10045)    1
  (0, 5783) 1
  (0, 5500) 1
  (0, 5455) 1
  (0, 3234) 1
  (0, 7107) 1
  (0, 6504) 1
  (0, 3235) 1
  (0, 1625) 1
  (0, 3591) 1
  (0, 6525) 1
  (0, 365)  1
  : :
  (0, 5527) 1
  (0, 9972) 1
  (0, 4526) 3
  (0, 3592) 4
  (0, 10214)    1
  (0, 895)  1
  (0, 10062)    2
  (0, 10210)    1
  (0, 1246) 1
  (0, 9224) 2
  (0, 4924) 1
  (0, 6336) 2
  (0, 9180) 8
  (0, 6366) 2
  (0, 414)  12
  (0, 1307) 1
  (0, 9309) 1
  (0, 9177) 1
  (0, 3166) 1
  (0, 396)  1
  (0, 9303) 7
  (0, 320)  5
  (0, 4782) 2
  (0, 10088)    3
  (0, 4481) 3
--------------------------------------------------------------
[[0 0 0 ... 0 0 0]]
--------------------------------------------------------------
10252
--------------------------------------------------------------

It's clear that there are 770 documents and 10,252 unique words in the corpus. My confusion is why is this line print(final_counts[769]) in my code printing this:

(0, 4819) 1
  (0, 2758) 1
  (0, 3854) 2
  (0, 3987) 1
  (0, 1188) 1
  (0, 3233) 1
  (0, 981)  1
  (0, 10065)    1
  (0, 9811) 1
  (0, 8932) 1
  (0, 9599) 1
  (0, 10150)    1
  (0, 7716) 1
  (0, 10045)    1
  (0, 5783) 1
  (0, 5500) 1
  (0, 5455) 1
  (0, 3234) 1
  (0, 7107) 1
  (0, 6504) 1
  (0, 3235) 1
  (0, 1625) 1
  (0, 3591) 1
  (0, 6525) 1
  (0, 365)  1
  : :
  (0, 5527) 1
  (0, 9972) 1
  (0, 4526) 3
  (0, 3592) 4
  (0, 10214)    1
  (0, 895)  1
  (0, 10062)    2
  (0, 10210)    1
  (0, 1246) 1
  (0, 9224) 2
  (0, 4924) 1
  (0, 6336) 2
  (0, 9180) 8
  (0, 6366) 2
  (0, 414)  12
  (0, 1307) 1
  (0, 9309) 1
  (0, 9177) 1
  (0, 3166) 1
  (0, 396)  1
  (0, 9303) 7
  (0, 320)  5
  (0, 4782) 2
  (0, 10088)    3
  (0, 4481) 3

The first index is the document index. I am printing the vector of 769th document (started from 0). So the first index should have been 769 instead of 0, like, (769, 4819) 1 . Why isn't it so?

Is this not expected behaviour? Given that `final_counts` has shape `(1, 10252)`, it can't have an index greater than `1` in the first axis. — piman314, Sep 25 '18 at 10:53
That's is absolutely correct, but after going through some documents online, I found the vector w.r.t a document is represented in this manner: <(document index, word index in corpus) count of that word in that document>. — Debbie, Sep 25 '18 at 10:57

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

As explained here this happens because it is a sparse matrix.

If you have a 100 documents with 964 features in a vectorizier

vectorizer = CountVectorizer()
transformed = vectorizer.fit_transform(documents)
>>> transformed
<100x964 sparse matrix of type '<class 'numpy.int64'>'
    with 3831 stored elements in Compressed Sparse Row format>

If you print the whole matrix you get the coordinates of non-zero elements in each document, this is your

<(document index, word index in corpus) count of that word in that document>

>>> print(transformed)
  (0, 30)   1
  (0, 534)  1
  (0, 28)   1
  (0, 232)  2
  (0, 298)  1
  (0, 800)  1
  (0, 126)  1
  : :
  (98, 467) 8
  (98, 461) 63
  (98, 382) 88
  (98, 634) 4
  (98, 15)  1
  (98, 450) 1139
  (99, 441) 1940

and e.g. print(transformed[(99, 441)]) is 1940

when you call print(transformed[0]) you get the following:

  (0, 30)   1
  (0, 534)  1
  (0, 28)   1
  (0, 232)  2
  (0, 298)  1
  (0, 800)  1
  : :
  (0, 683)  12
  (0, 15)   1
  (0, 386)  1
  (0, 255)  1
  (0, 397)  1
  (0, 450)  10
  (0, 682)  2782

because transformed[0] is itself a sparse matrix with one row and 32 non-zero elements printed above

>>> transformed[0] 
<1x964 sparse matrix of type '<class 'numpy.int64'>'
with 32 stored elements in Compressed Sparse Row format>

and you can access it with these tuples e.g. transformed[0][(0, 682)] returns 2782.

(Note that transformed[0].toarray().shape is (1, 964) not (964,))

Text classification + Bag of words + Python : Bag of words doesn't show document index

1 Answers1