5

Considering data like:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)  

I want to exclude the text column using the OHE functionality.

Why does the following not work?

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))       
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'

It says in the documentation:

categorical_features: “all” or array of indices or mask :
  Specify what features are treated as categorical.
   ‘all’ (default): All features are treated as categorical.
   array of indices: Array of categorical feature indices.
   mask: Array of length n_features and with dtype=bool.

I'm using a mask, yet it still tries to convert to float.

Even using

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool), 
                    dtype=dt)        
ohe.fit(d)

Same error.

And also in the case of "array of indices":

ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)        
ohe.fit(d)
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160

3 Answers3

3

You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.

If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})

#    number_1  number_2 text
# 0         1         2  aaa
# 1         1         2  bbb

# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
    ('text', SomeEncoder),
    (['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)
Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52
  • 5
    If you're using a pandas dataframe, setting the type using [`df[column].astype('category')`](https://pandas-docs.github.io/pandas-docs-travis/categorical.html) and using the [`get_dummies`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) method will one-hot encode text columns for you as well. – hume Dec 10 '15 at 16:49
  • @hume, yes, but if you get additional dataframe of the same structure - you should be able to encode it with the same encoding you used in train set. – Ibraim Ganiev Dec 10 '15 at 19:58
  • yep, agreed! If you need consistent transforms across datasets where those text values may differ, you're better off using an sklearn encoder that has `fit` and `transform` methods. – hume Dec 10 '15 at 20:23
2

I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.

The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.

So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:

d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)

Now you can check your feature categories:

ohe.active_features_
Out[22]: array([5, 6], dtype=int64)
Leb
  • 15,483
  • 10
  • 56
  • 75
  • Note that I do NOT want to use the letters. My bad for not indicating, but I'd like to ignore the text column (since the OHE only accepts integers, and I just want to drop this text). – PascalVKooten Dec 04 '15 at 15:18
  • AFAIK you'll need to drop them before feeding them into the encoder. Can you not do that? Where `d = d[:,1:]` – Leb Dec 04 '15 at 15:24
  • Not really if you want to make it automated. Again, have a look at the argument to OHE called `categorical_features`: it seems that this exactly describes what I would like it to do. The documentation further states: "Non-categorical features are always stacked to the right of the matrix." – PascalVKooten Dec 04 '15 at 15:34
  • @PascalvKooten, Leb is right, each feature which you are passing into OHE should be numerical, no matter choosed you some of them or not. That's implementation problem. So it's easier for you to just drop nonselected features before your call to OHE. – Ibraim Ganiev Dec 05 '15 at 07:31
1

I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.

Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).

This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.

I agree that the documentation of sklearn.preprocessing.OneHotEncoder is rather misleading in that regard.

Bahman Engheta
  • 106
  • 1
  • 7