1

Wanting to understand "the hashing trick" I've written the following test code:

import pandas as pd
from sklearn.feature_extraction import FeatureHasher
test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']})
h = FeatureHasher(n_features=4, input_type='string')
f = h.transform(test.type)
print(f.toarray())

In the above example, I'm mapping 8 categories into 4 columns, and the output is:

[[ 0.  0.  1.  0.]<-a
 [ 0. -1.  0.  0.]<-b
 [ 0. -1.  0.  0.]<-c
 [ 0.  0.  0.  1.]<-d
 [ 0.  0.  0.  1.]<-e
 [ 0.  0.  0.  1.]<-f
 [ 0.  0. -1.  0.]<-g
 [ 0. -1.  0.  0.]]<-g

In the resulting matrix, I can see repetitions and some categories are represented the same way. Why is that? 8 categories can be mapped into 4 columns if I use a binary representation.

Can someone please explain the output of this technique and maybe elaborate a bit?

yatu
  • 86,083
  • 12
  • 84
  • 139
Roni Gadot
  • 437
  • 2
  • 19
  • 30

1 Answers1

2

A FeatureHasher will lead to undesired results if you set n_features to such a low value. The reason for this is the way in which it maps categories to column indices.

As opposed to a CountVectorizer for instance, where each category is assigned a unique integer index corresponding to a column just by order of occurrence, FeatureHasher will apply a hash function to the features to determine the column index of each category. It's main advantage is hence an increased speed. However by limiting n_features to such a low value, it is likely that the result of hashing a given category will result in an index higher than the set n_features, and consequently what you'll get is a truncated feature vector.


We can actually check this by reproducing how the hashing is done in _hashing_fast which uses murmurhash3_bytes_s32 to generate the hashes:

from sklearn.utils.murmurhash import murmurhash3_bytes_s32

raw_X = test['type']
raw_X = iter(raw_X)
raw_X = (((f, 1) for f in x) for x in raw_X)

for x in raw_X:
    for f, v in x:
        f = f'{f}={v}'
        fb = (f).encode("utf-8")
        h = murmurhash3_bytes_s32(fb, seed=0)
        print(f'{f[0]} -> {h}')

Which as you can see yields larger hash values precisely for e and f, which are truncated to the lower hash corresponding to d:

a -> -424864564
b -> -992685778
c -> -1984769100
d -> 728527081
e -> 2077529484
f -> 2074045163
g -> -1877798433
h -> -51608576
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thanks for your reply, I thought the whole purpose of FeatureHasher is to generate a lower dimension vector for a high cardinality categories. Also the n_features in the official example (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) is 10. What am I missing? – Roni Gadot Apr 19 '20 at 11:17
  • No the default `n_features` is *very* high, precisely `1048576`. I've updated with more details @RoniGadot – yatu Apr 19 '20 at 11:30