2

I am using feature hashing to convert string variables to a hash for classification purposes. I noticed after some digging that though both R and Python implementation of MurmurHash3 feature hashing (R: FeatureHashing:hashed.model.matrix and Python: sklearn.feature_extraction.FeatureHasher), the results are different in terms of where the features are placed. I thought MurmurHash is supposed to be deterministic, since when you run the same operation on the same system, you get the same resulting hash. However, between implementations there may be an issue with seeds? This is causing me a problem because my classification model (xgboost, which I realize has issues between R and Python) may different results on the same data, as others have pointed out. However, I seem to have figured that part out.

Here is an example of code in R:

library(FeatureHasher)
#create a single-feature dataframe
data_tmp <- data.frame(x=c("A_C","B_D"))

#> data_tmp
#    x
#1 A_C
#2 B_D

#create feature hash.  R by default includes an intercept, so remove that
#with ~x -1
fhash <- hashed.model.matrix(~x -1, data=data_tmp, hash.size=16, create.mapping=TRUE)

as.matrix(fhash)
#     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#[1,] 0 0 0 0 1 0 0 0 0  0  0  0  0  0  0  0
#[2,] 0 0 0 0 0 0 0 0 0  0  0  1  0  0  0  0

As you can see, R places "A_C" in the fifth column and "B_D" in the 12th. These happen consistently. Now let's run the equivalent code in Python. Note, there are several ways to input to feature hash in Python, as a dict or as a list of lists. I tried several and they give me the same result.

from sklearn.feature_extraction import FeatureHasher
import pandas as pd

#create as a list of two single-element lists
data_tmp = [["A_C"],["B_D"]]

#can also do this, does the same thing
#pd.DataFrame(data_tmp)

#set up feature hash with same settings above
feature_hash = FeatureHasher( alternate_sign = False, n_features = 16, input_type="string")
fhash = feature_hash.transform( data_tmp )
fhash.todense()
#matrix([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Here, both "A_C" and "B_D" get mapped not only to different indices than before, but also both to the same column. This means this feature has collided because it is no longer distinguishable that the values 1 represent different features, which will degrade the classifier's ability.

Is there something obvious I am missing here? I saw, for instance, I saw this post: Murmur3 hash different result between Python and Java implementation, but I don't know enough about it. One thing I noticed is that in R if you use the create.mapping option and then run

hash.mapping(fhash)
#xB_D xA_C 
#12    5 

when it prints out, it puts an "x" (the variable name) before the string, so I thought this might be causing the problem. But then I tried re-running the Python code above except with

data_tmp = [["xA_C"],["xB_D"]]

but while I got different results than before, it didn't match R's mapping. Maybe it's something internal in how Python stores the variable names? Thanks in advance, I'd really like to figure this out.

Sam A.
  • 413
  • 4
  • 13

0 Answers0