Need a Work-around for OneHotEncoder Issue in SKLearn Preprocessing

Question

So, it seems that OneHotEncoder won't work with the np.int64 datatype (only np.int32)! Here's a sample of code:

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

a = np.array([[56748683,8511896545,51001984320],[18643548615,28614357465,56748683],[8511896545,51001984320,40084357915]])
b = pd.DataFrame(a, dtype=np.int64)

ohe = OneHotEncoder()
c = ohe.fit_transform(b).toarray()

When I run this I get the following error: "ValueError: X needs to contain only non-negative integers."

As you can see, X DOES contain only non-negative integers! When I trim a few of the digits and change the datatype to int32 it works fine:

a = np.array([[56748,8511896,51001984],[18643548,28614357,56748],[8511896,51001984,40084357]])
b = pd.DataFrame(a, dtype=np.int32)
ohe = OneHotEncoder()
c = ohe.fit_transform(b).toarray()

Unfortunately, the data I need to encode has 11 digits (which can't be represented by int32). So, any suggestions would be helpful...

Also, I should mention, I don't necessarily need a one hot encoding, just need to create dummy variables. Thanks!

For simple encoding of whole data at once, the answer by @σηγ works great. But when you need to divide the data into train and test, it wont work if the test data contains different values than train data. — Vivek Kumar, Dec 01 '17 at 05:37
That's a really great point @Vivek Kumar - thanks for the reminder! — A.K. Ferrara, Dec 11 '17 at 04:52

score 1 · Accepted Answer · answered Nov 30 '17 at 19:13

Pandas has a get_dummies function that creates dummy variables:

import numpy as np
import pandas as pd

a = np.array([[56748683,8511896545,51001984320],[18643548615,28614357465,56748683],[8511896545,51001984320,40084357915]])
b = pd.DataFrame(a, dtype=np.int64)
b = b.astype('object')
c = pd.get_dummies(b)

Need a Work-around for OneHotEncoder Issue in SKLearn Preprocessing

1 Answers1