-2

I am applying OneHotEncoder on numpy array.

Here's the code

print X.shape, test_data.shape #gives 4100, 15) (410, 15)
onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
X = onehotencoder_1.fit_transform(X).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
test_data = onehotencoder_2.fit_transform(test_data).toarray()

print X.shape, test_data.shape #gives (4100, 46) (410, 43)

where both X and test_data are <type 'numpy.ndarray'>

X is my train set while test_data my test set.

How come the no. of columns different for X & test_data. they should be 46 or either 43 for both after applying onehotencoder.

I am applying OnehotEncoder on specific attributes as they are categorical in nature in both X and test_data

Can someone point out what is wrong here?

prashantitis
  • 1,797
  • 3
  • 23
  • 52

1 Answers1

2

Don't use a new OneHotEncoder on test_data, use the first one, and only use transform() on it. Do this:

test_data = onehotencoder_1.transform(test_data).toarray()

Never use fit() (or fit_transform()) on testing data.

The different number of columns are entirely possible because it may happen that test data dont contain some categories which are present in train data. So when you use a new OneHotEncoder and call fit() (or fit_transform()) on it, it will only learn about categories present in test_data. So there will be difference between the columns.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • I think you mean to say Do this `test_data = onehotencoder_1.transform(test_data).toarray()`.....right ? while you accidentally wrote this `test_data = onehotencoder_1.fit_transform(test_data).toarray()` ? – prashantitis May 22 '18 at 06:32
  • @Guru Yes. Sorry, that was a copy paste typo. Corrected now. – Vivek Kumar May 22 '18 at 06:41
  • Thanks @Vivek. Can you also throw light on how to get rid of dummy variable trap in using OnehotEncoder. I have read about it at many places, and people suggests to drop on column. How do i drop column here and which column to drop ? – prashantitis May 22 '18 at 07:18
  • @Guru That is not that simple to do in OneHotEncoder. You have to write a custom class for that. For each category defined in `categorical_features`, there will be a column to remove. You can use `pandas.get_dummies()` but that wont work well with train and test split. – Vivek Kumar May 22 '18 at 07:22
  • Thanks for your inputs. As mentioned by you `For each category defined in categorical_features, there will be a column to remove` which column ? How to decide that ? I can write a custom class. Any resource explaining same would be really helpful. Thanks – prashantitis May 22 '18 at 08:29
  • @Guru Most libraries remove the first column (level) in the transformed data. You can check the source code of [OneHotEncoder on the github](https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/data.py#L1840) for starting – Vivek Kumar May 22 '18 at 08:47