10

I would like to hash feature ‘Genre’ into 6 columns and separately feature ‘Publisher’ into another six columns. I want something like below:

      Genre      Publisher  0    1    2    3    4    5      0    1    2    3    4    5 
0     Platform  Nintendo  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0
1       Racing      Noir -1.0  0.0  0.0  0.0  0.0 -1.0   -1.0  0.0  0.0  0.0  0.0 -1.0
2       Sports     Laura -2.0  2.0  0.0 -2.0  0.0  0.0   -2.0  2.0  0.0 -2.0  0.0  0.0
3  Roleplaying      John -2.0  2.0  2.0  0.0  1.0  0.0   -2.0  2.0  2.0  0.0  1.0  0.0
4       Puzzle      John  0.0  1.0  1.0 -2.0  1.0 -1.0    0.0  1.0  1.0 -2.0  1.0 -1.0
5     Platform      Noir  0.0  2.0  2.0 -1.0  1.0  0.0    0.0  2.0  2.0 -1.0  1.0  0.0

The following code does what I want to do

import pandas as pd
d = {'Genre': ['Platform', 'Racing','Sports','Roleplaying','Puzzle','Platform'], 'Publisher': ['Nintendo', 'Noir','Laura','John','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6, input_type='string')
fh2 = FeatureHasher(n_features=6, input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre', 'Publisher']], pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],
          axis=1)

This works for the above two feature but If I have lets say 40 categorical features then this approach would be tedious. Is there any other way to do?

nick
  • 1,090
  • 1
  • 11
  • 24
Noor
  • 126
  • 2
  • 8

2 Answers2

6

Hashing (Update)

Assuming that new categories might show up in some of the features, hashing is the way to go. Just 2 notes:

  • Be aware of the possibility of collision and adjust the number of features accordingly
  • In your case, you want to hash each feature separately

One Hot Vector

In case the number of categories for each feature is fixed and not too large, use one hot encoding.

I would recommend using either of the two:

  1. sklearn.preprocessing.OneHotEncoder
  2. pandas.get_dummies

Example

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
                   'feature_2': ['cat', 'dog', 'elephant', 'zebra']})

# Approach 0 (Hashing per feature)
n_orig_features = df.shape[1]
hash_vector_size = 6
ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size, 
                        input_type='string'), i) for i in range(n_orig_features)])

res_0 = ct.fit_transform(df)  # res_0.shape[1] = n_orig_features * hash_vector_size

# Approach 1 (OHV)
res_1 = pd.get_dummies(df)

# Approach 2 (OHV)
res_2 = OneHotEncoder(sparse=False).fit_transform(df)

res_0 :

array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1., -1.,  0., -1.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  2., -1.,  0.,  0.,  0.],
       [ 0., -1.,  0.,  0.,  0.,  0., -2.,  2.,  2., -1.,  0., -1.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  2.,  1., -1.,  0., -1.]])

res_1 :

   feature_1_A  feature_1_G  feature_1_T  feature_2_cat  feature_2_dog  feature_2_elephant  feature_2_zebra
0            1            0            0              1              0                   0                0
1            0            1            0              0              1                   0                0
2            0            0            1              0              0                   1                0
3            1            0            0              0              0                   0                1

res_2 :

array([[1., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 1.]])
Jan K
  • 4,040
  • 1
  • 15
  • 16
  • 1
    No I can't use one hot encoding because I am taking data in chunks and it may happen that a new chunk contain some categorical values in a feature/features that were not present in the feature/features of first chunk. In that case onehot encoding wil make more columns as compare to first chunk. As I need to apply data to partial_fit classifier I need same number of columns in each iteration. Please see this link ( https://stackoverflow.com/questions/54096164/how-to-sent-additional-unique-categorical-values-of-features-to-partial-fit-of-s?noredirect=1#comment95116600_54096164 ) – Noor Jan 19 '19 at 12:18
  • Your code is very helpful. Thank you. But as you have said that “Be aware of the possibility of collision and adjust the number of features accordingly”. So if there are 40 feature columns so every column will probably need to hash to different number of columns. For eg in your code I want feature_1 to be hash to vector of 6 ( hash_vector_size=6)and feature_2 to be hash to vector of 5(hash_vector_size=5) how what I modify the code. PS: I tried to do on my own but could not. – Noor Jan 19 '19 at 13:36
  • 1
    I would create a dictionary where keys are original feature names and columns are hash_vector_size for that given feature. Then simply use this dictionary when instantiating the ColumnTransformer. See the `transformers` argument in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer – Jan K Jan 19 '19 at 13:51
  • 1
    ~~~ ct = ColumnTransformer([(f't_{i_key}', FeatureHasher(n_features=i_value, input_type='string'),i_key ) for i_key, i_value in dict_hashed_vector_size.items() ]) ~~~ I have done as you recommended. It is working I guess. – Noor Jan 20 '19 at 13:03
  • There is something odd about Approach 0. As you can see it in the results it produces 4 indices for the first row instead of 2. I would also expect to get 2 indices since there are 2 different columns. The reason is that if you use the feature hasher with input type 'string' it expects a list of strings. If you just input a string, then EVERY char will be hashed. Therefore for the first row ['A', 'cat'] you get 4 indices instead of 2. – filthysocks Mar 06 '19 at 12:55
2

Even though, I am late here, from the examples I have seen on Kaggle , FeatureHashing is performed at once for multiple columns (ie on a DataFrame) rather than for individual columns and concatenating the sparse matrices. See Notebooks on Kaggle, here and here. I have also used both ways of performing feature hashing on this data, ie:

a. Hash individual categorical columns and concatenate the results
b. Hash all categorical columns of a DataFrame at once

Logistic Regression classifier gave significantly better results when approach (b) was followed rather than approach (a).

Ashok K Harnal
  • 1,191
  • 2
  • 15
  • 28