5

I'm using HashingVectorizer function from sklearn.feature_extraction.text but I do not understand how it works.

My code

from sklearn.feature_extraction.text import HashingVectorizer
corpus = [ 'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
vectorizer = HashingVectorizer(n_features=2**3)
X = vectorizer.fit_transform(corpus)
print(X)

My result

(0, 0)        -0.8944271909999159
(0, 5)        0.4472135954999579
(0, 6)        0.0
(1, 0)        -0.8164965809277261
(1, 3)        0.4082482904638631
(1, 5)        0.4082482904638631
(1, 6)        0.0
(2, 4)        -0.7071067811865475
(2, 5)        0.7071067811865475
(2, 6)        0.0
(3, 0)        -0.8944271909999159
(3, 5)        0.4472135954999579
(3, 6)        0.0

I read a lot of paper on the Hashing Trick, like this article https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f

I understand this article but do not see the relationship with the result obtained above.

Can you explain me with simple example how work HashingVectorizer please

Toni Garcia
  • 81
  • 1
  • 7

2 Answers2

1

I think the results are not making sense because of the negative values and the default normalization.

If you do this:

vectorizer = HashingVectorizer(n_features=2**3,norm=None,alternate_sign=False)

You should see the raw counts and the results should start making sense. If you want normalized term frequency, then set norm='l2'.

The results that you are printing is essentially (document_id,position_in_matrix) counts

For more information, see this article on HashingVectorizer vs. CountVectorizer.

dolly
  • 186
  • 2
  • 5
0

The result is a sparse representation of the matrix (size 4x8).

print(X.toarray())

Output:

[[-0.89442719  0.          0.          0.          0.          0.4472136
   0.          0.        ]
 [-0.81649658  0.          0.          0.40824829  0.          0.40824829
   0.          0.        ]
 [ 0.          0.          0.          0.         -0.70710678  0.70710678
   0.          0.        ]
 [-0.89442719  0.          0.          0.          0.          0.4472136
   0.          0.        ]]

To get a vector for a token, we calculate its hash and get the column index in the matrix. The column is the vector of the token.