I have written the following code to produce bag of words:
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(data['description'].values.astype('U'))
vocab = count_vect.get_feature_names()
print(type(final_counts)) #final_counts is a sparse matrix
print("--------------------------------------------------------------")
print(final_counts.shape)
print("--------------------------------------------------------------")
print(final_counts.toarray())
print("--------------------------------------------------------------")
print(final_counts[769].shape)
print("--------------------------------------------------------------")
print(final_counts[769])
print("--------------------------------------------------------------")
print(final_counts[769].toarray())
print("--------------------------------------------------------------")
print(len(vocab))
print("--------------------------------------------------------------")
I am getting following output:
<class 'scipy.sparse.csr.csr_matrix'>
--------------------------------------------------------------
(770, 10252)
--------------------------------------------------------------
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
--------------------------------------------------------------
(1, 10252)
--------------------------------------------------------------
(0, 4819) 1
(0, 2758) 1
(0, 3854) 2
(0, 3987) 1
(0, 1188) 1
(0, 3233) 1
(0, 981) 1
(0, 10065) 1
(0, 9811) 1
(0, 8932) 1
(0, 9599) 1
(0, 10150) 1
(0, 7716) 1
(0, 10045) 1
(0, 5783) 1
(0, 5500) 1
(0, 5455) 1
(0, 3234) 1
(0, 7107) 1
(0, 6504) 1
(0, 3235) 1
(0, 1625) 1
(0, 3591) 1
(0, 6525) 1
(0, 365) 1
: :
(0, 5527) 1
(0, 9972) 1
(0, 4526) 3
(0, 3592) 4
(0, 10214) 1
(0, 895) 1
(0, 10062) 2
(0, 10210) 1
(0, 1246) 1
(0, 9224) 2
(0, 4924) 1
(0, 6336) 2
(0, 9180) 8
(0, 6366) 2
(0, 414) 12
(0, 1307) 1
(0, 9309) 1
(0, 9177) 1
(0, 3166) 1
(0, 396) 1
(0, 9303) 7
(0, 320) 5
(0, 4782) 2
(0, 10088) 3
(0, 4481) 3
--------------------------------------------------------------
[[0 0 0 ... 0 0 0]]
--------------------------------------------------------------
10252
--------------------------------------------------------------
It's clear that there are 770 documents and 10,252 unique words in the corpus. My confusion is why is this line print(final_counts[769])
in my code printing this:
(0, 4819) 1
(0, 2758) 1
(0, 3854) 2
(0, 3987) 1
(0, 1188) 1
(0, 3233) 1
(0, 981) 1
(0, 10065) 1
(0, 9811) 1
(0, 8932) 1
(0, 9599) 1
(0, 10150) 1
(0, 7716) 1
(0, 10045) 1
(0, 5783) 1
(0, 5500) 1
(0, 5455) 1
(0, 3234) 1
(0, 7107) 1
(0, 6504) 1
(0, 3235) 1
(0, 1625) 1
(0, 3591) 1
(0, 6525) 1
(0, 365) 1
: :
(0, 5527) 1
(0, 9972) 1
(0, 4526) 3
(0, 3592) 4
(0, 10214) 1
(0, 895) 1
(0, 10062) 2
(0, 10210) 1
(0, 1246) 1
(0, 9224) 2
(0, 4924) 1
(0, 6336) 2
(0, 9180) 8
(0, 6366) 2
(0, 414) 12
(0, 1307) 1
(0, 9309) 1
(0, 9177) 1
(0, 3166) 1
(0, 396) 1
(0, 9303) 7
(0, 320) 5
(0, 4782) 2
(0, 10088) 3
(0, 4481) 3
The first index is the document index. I am printing the vector of 769th document (started from 0). So the first index should have been 769 instead of 0, like, (769, 4819) 1
. Why isn't it so?