11

I have been trying to use a categorical inpust in a regression tree (or Random Forest Regressor) but sklearn keeps returning errors and asking for numerical inputs.

import sklearn as sk
MODEL = sk.ensemble.RandomForestRegressor(n_estimators=100)
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works

MODEL = sk.tree.DecisionTreeRegressor()
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works

To my understanding, categorical inputs should be possible in these methods without any conversion (e.g. WOE substitution).

Has anyone else had this difficulty?

thanks!

jpsfer
  • 594
  • 3
  • 7
  • 18

2 Answers2

16

scikit-learn has no dedicated representation for categorical variables (a.k.a factors in R), one possible solution is to encode the strings as int using LabelEncoder:

import numpy as np
from sklearn.preprocessing import LabelEncoder  
from sklearn.ensemble import RandomForestRegressor

X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)]) 
y = np.asarray([1,2.5,3,4])

# transform 1st column to numbers
X[:, 0] = LabelEncoder().fit_transform(X[:,0]) 

regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))

Output:

[[ 0.  1.  2.]
 [ 1.  2.  3.]
 [ 0.  3.  2.]
 [ 2.  1.  3.]]
[ 1.61333333  2.13666667  2.53333333  2.95333333]

But remember that this is a slight hack if a and b are independent categories and it only works with tree-based estimators. Why? Because b is not really bigger than a. The correct way would be to use the OneHotEncoder after the LabelEncoder or pd.get_dummies yielding two separate, one-hot encoded columns for X[:, 0].

import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)]) 
y = np.asarray([1,2.5,3,4])

# transform 1st column to numbers
import pandas as pd
X_0 = pd.get_dummies(X[:, 0]).values
X = np.column_stack([X_0, X[:, 1:]])

regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))
Matt
  • 17,290
  • 7
  • 57
  • 71
  • 1
    Thanks for that. I don't think it solves the problem though; the 'numerical labels' create an assumption of a linear progression that will most likely be untrue to what you are trying to predict. Imagine a decision tree node and when deciding the next cut-off split using for example '<2 and >=2' does not make the same sense as "if in ('a','c')". – jpsfer Nov 21 '13 at 12:33
  • I misread your question. I just saw that you want to treat everything as categorical. I will update the example accordingly... – Matt Nov 21 '13 at 16:09
  • This helps and I had tried this (with less elegant code I should say), but the problem is that this makes the information contained in the variable less likely to be chosen in a regression tree. I guess this happens because the predictive power is now split across multiple variables. Nevertheless, your code was very helpful in how to do this much more efficiently. Many thanks. – jpsfer Nov 22 '13 at 00:18
1

You must dummy code by hand in python. I would suggest using pandas.get_dummies() for one hot encoding. For Boosted trees I have had success using factorize() to achieve Ordinal Encoding.

There is also a whole package for this sort of thing here.

For a more detailed explanation look in this Data Science Stack Exchange post.

Keith
  • 4,646
  • 7
  • 43
  • 72