8

I have a Pandas Dataframe with 2 categorical variables, and ID variable and a target variable (for classification). I managed to convert the categorical values with OneHotEncoder. This results in a sparse matrix.

ohe = OneHotEncoder()
# First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input
... remapping code ...

ohe.fit(df[['col_a', 'col_b']])
ohe.transform(df[['col_a', 'col_b']])

But I have no clue how I can use this sparse matrix in a DecisionTreeClassifier? Especially when I want to add some other non-categorical variables in my dataframe later on. Thanks!

EDIT In reply to the comment of miraculixx: I also tried the DataFrameMapper in sklearn-pandas

mapper = DataFrameMapper([
    ('id_col', None),
    ('target_col', None),
    (['col_a'], OneHotEncoder()),
    (['col_b'], OneHotEncoder())
])

t = mapper.fit_transform(df)

But then I get this error:

TypeError: no supported conversion for types : (dtype('O'), dtype('int64'), dtype('float64'), dtype('float64')).

shivsn
  • 7,680
  • 1
  • 26
  • 33
Bert Carremans
  • 1,623
  • 4
  • 23
  • 47
  • [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) is really helpful when working with dataframes and sklearn. – miraculixx Jul 21 '16 at 21:31

2 Answers2

14

I see you are already using Pandas, so why not using its get_dummies function?

import pandas as pd
df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])

result

   name age-group
0  rick     young
1  phil       old
2  john  teenager

now you encode with get_dummies

pd.get_dummies(df)

result

name_john  name_phil  name_rick  age-group_old  age-group_teenager  \
0          0          0          1              0                   0   
1          0          1          0              1                   0   
2          1          0          0              0                   1   

   age-group_young  
0                1  
1                0  
2                0

And you can actually use the new Pandas DataFrame in your Sklearn's DecisionTreeClassifier.

Guiem Bosch
  • 2,728
  • 1
  • 21
  • 37
  • 2
    Thanks Guiem Bosch, that worked. However, I had to specify to use the get_dummies only on the two columns. If I left the ID variable in the Dataframe I got the message that my kernel died. So the following code worked: pd.get_dummies(df[['col_a', 'col_b']]) – Bert Carremans Jul 22 '16 at 07:48
  • Additionally, the remapping of string values to integers is not necessary. Otherwize get_dummies doesn't seem to do anything. – Bert Carremans Jul 22 '16 at 14:50
1

Look at this example from scikit-learn: http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py

Problem is that you are not using the sparse matrices to xx.fit(). You are using the original data.

Merlin
  • 24,552
  • 41
  • 131
  • 206