How exactly does feature hashing work?

Question

I have read many online articles on feature hashing of categorical variables for machine learning. Unfortunately, I still couldn't grasp the concept and understand how it works. I will illustrate my confusion through the sample dataset and hashing function below that I grabbed from another site:

>>>data

   pop     state  year
0  1.5      Ohio  2000
1  1.7      Ohio  2001
2  3.6  New York  2002
3  2.4    Nevada  2001
4  2.9    Nevada  2002
5  1.8    Oregon  2003

>>> def hash_col(df, col, N):
       cols = [col + "_" + str(i) for i in range(N)]
       def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
       df[cols] = df[col].apply(xform)
       return df.drop(col,axis=1)

The below functions are used to print out the different transformed output by specifying different number of dimensions (or in other words, hashed features):

>>> print(hash_col(data, 'state',4))

   pop  year  state_0  state_1  state_2  state_3
0  1.5  2000        0        0        1        0
1  1.7  2001        0        0        1        0
2  3.6  2002        0        0        0        1
3  2.4  2001        0        1        0        0
4  2.9  2002        0        1        0        0
5  1.8  2003        0        0        0        1

>>> print(hash_col(data, 'state',5))
   pop  year  state_0  state_1  state_2  state_3  state_4
0  1.5  2000        1        0        0        0        0
1  1.7  2001        1        0        0        0        0
2  3.6  2002        1        0        0        0        0
3  2.4  2001        0        0        1        0        0
4  2.9  2002        0        0        1        0        0
5  1.8  2003        0        0        0        0        1

>>> print(hash_col(data, 'state',6))
   pop  year  state_0  state_1  state_2  state_3  state_4  state_5
0  1.5  2000        0        0        0        0        1        0
1  1.7  2001        0        0        0        0        1        0
2  3.6  2002        0        0        0        0        0        1
3  2.4  2001        0        0        0        1        0        0
4  2.9  2002        0        0        0        1        0        0
5  1.8  2003        0        0        0        0        0        1

What I can't understand is what does each of the 'state_0', 'state_1', 'state_2' etc. column represent. Also, since there are 4 unique states in my dataset (Ohio, New York, Nevada, Oregon), why are all the '1' allocated to just 3 'state_n' columns instead of 4 as in one hot encoding? For example, when I set the number of dimensions to 6, the output had two '1' in state_3, state_4 and state_5, but there was no '1' in state_0, state_1 and state_2. Any feedback would be greatly appreciated!

Bert Kellerman · Accepted Answer · 2021-06-04T00:28:43.377

4

Feature hashing is typically used when you don't know all the possible values of a categorical variable. Because of this, we can't create a static mapping from categorical values to columns. So a hash function is used to determine which column each categorical value corresponds to.

This is not the best use case because we know there are exactly 50 states and could just use one hot encoding.

A hash function will also have collisions, where different values are mapped to the same value. That is what's happening here. Two different state names hash to the same value after the modulus operation in the hash function.

One way to alleviate collisions is to make your feature space(number of columns) larger than the number of possible categorical values.

edited Jun 04 '21 at 00:28

answered Apr 16 '18 at 12:33

Bert Kellerman

1,590
10
17

Thanks @Bert. Another question I have is if feature space(number of columns) is equal to or larger than the number of possible categorical values, wouldn't that produce about the same number of columns as one hot encoding, since OHE has 1 feature column per categorical feature? If so, what would be the advantage of feature hashing over OHE? – Stanleyrr Apr 16 '18 at 15:43
1

You use it when you don't know all your categories ahead of time. For example, consider the variable 'brand of shirt'. You could create a one hot encoding from your training set, but then what happens at prediction time when you see a sample with a brand that wasn't in the training set? – Bert Kellerman Apr 17 '18 at 12:00

How exactly does feature hashing work?

1 Answers1