Wanting to understand "the hashing trick" I've written the following test code:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher
test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']})
h = FeatureHasher(n_features=4, input_type='string')
f = h.transform(test.type)
print(f.toarray())
In the above example, I'm mapping 8 categories into 4 columns, and the output is:
[[ 0. 0. 1. 0.]<-a
[ 0. -1. 0. 0.]<-b
[ 0. -1. 0. 0.]<-c
[ 0. 0. 0. 1.]<-d
[ 0. 0. 0. 1.]<-e
[ 0. 0. 0. 1.]<-f
[ 0. 0. -1. 0.]<-g
[ 0. -1. 0. 0.]]<-g
In the resulting matrix, I can see repetitions and some categories are represented the same way. Why is that? 8 categories can be mapped into 4 columns if I use a binary representation.
Can someone please explain the output of this technique and maybe elaborate a bit?