How to use dummy variable to represent categorical data in python scikit-learn random forest

Question

I'm generating feature vector for random forest classifier of scikit-learn . The feature vector represents the name of 9 protein amino acid residues. There are 20 possible residue names. So, I use 20 dummy variables to represent one residue name, for 9 residue, I have 180 dummy variables.

For example, if the 9 residues in the sliding window are: ARNDCQEGH (every one letter represent a name of a protein residue),my feature vector will be:

"True\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\t
False\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tTrue\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\tFalse\n"

Also, I tried to use (1,0) to replace (True,False)

After training and testing Scikit's random forest classifier model, I found it totally did not work. But Scikit's random forest can work with my other numerical data.

Can Scikit's random forest deal with categorical variable or dummy variable? If so, could you provide an example showing how it works.

Here is how I set the random forest:

clf=RandomForestClassifier (n_estimators=800, criterion='gini', n_jobs=12, max_depth=None, compute_importances=True, max_features='auto', min_samples_split=1,  random_state=None)

Thanks a lot in advance!

ogrisel · Accepted Answer · 2016-11-17T08:58:56.680

3

Using boolean features encoded as 0 and 1 should work. If the predictive accuracy is bad even with a large number of decision trees in your forest it might be the case that your data is too noisy to get the learning algorithm to not pickup any think interesting.

Have you tried to fit a linear model (e.g. Logistic Regression) as a baseline on this data?

Edit: in practice using integer coding for categorical variables tends to work very well for many randomized decision trees models (such as RandomForest and ExtraTrees in scikit-learn).

edited Nov 17 '16 at 08:58

answered Apr 05 '13 at 12:11

ogrisel

39,309
12
116
125

How can I use boolean features encoded as 0 and 1? I mean if I use 0 and 1, they will be treated as integer. d – Lucy Apr 05 '13 at 21:18
And you mentioned '2 consecutive splits are required to isolate samples with feature 1 instead of just one if you use the dummy variables (aka boolean one-hot) encoding', so how can I set 2 consecutive splits in random forest classifier of scikit-learn? – Lucy Apr 05 '13 at 21:55
Actually, I used Logistic Regression on similar feature vectors. It worked very good. This time, I want to try random forest, since people say random forest is good at imbalanced data set. – Lucy Apr 05 '13 at 22:00
I just tried to use integer variable(0,1) to replace boolean variables(False,True). The result is almost same to the one using boolean. – Lucy Apr 05 '13 at 22:05
using (0, 1) or (True, False) is the same as internally scikit-learn will convert everything to feature vectors of floats with values 0.0 and 1.0. This is weird that logistic regression can outperform significantly the RF model though. What do you mean by "not work" and "work"? Do you measure the f1-score? What are the values? Did you cross validate? Have you tried LogisticRegression on the exact same data? (maybe the labels have been corrupted during the data preprocessing?). – ogrisel Apr 08 '13 at 08:44
Also could you publish your data (if less than a couple of MB) along with a minimalistic (e.g. 20 lines max) reproduction python script that exhibits the problem on http://gist.github.com for instance? – ogrisel Apr 08 '13 at 08:46
When the f1-score can reach or is better than we expect, I'll say it works :). So far, I can get f1-score 32, however, people report they can get f1-score 40. But, I'm using different dataset from theirs. I haven't tried LogisticRegression on the exact same data. How can the labels be corrupted during the data preprocessing? my data set is 190 MB – Lucy Apr 09 '13 at 16:44
You mean f1-score is 0.32 right? How many target classes do you have to get an idea of the chance level? If you cannot find a way to provide us with a minimalistic reproduction case, I don't see how people can help you. – ogrisel Apr 10 '13 at 09:16
I have 2 target classes that I use 'True' and 'False' to present.The data set is too big. here is my code: – Lucy Apr 12 '13 at 23:02
If the predicted probability for positive class is bigger than 0.3, the target will be predicted as positive. – Lucy Apr 12 '13 at 23:08

score 2 · Answer 2 · answered Apr 05 '13 at 07:48

2

Scikits random forest classifier can work with dummified variables, but it can also use categorical variables directly, which is the preferred approach. Just map your strings into integers. Assume your features vector is ['a' ,'b', 'b', 'c']

vals = ['a','b','b','c']
#create a map from your variable names to unique integers:
intmap = dict([(val, i) for i, val in enumerate(set(vals))]) 
#make the new array hold corresponding integers instead of strings:
new_vals = [intmap[val] for val in vals]

new_vals now holds values [0, 2, 2, 1], and you can give it to RF directly, without doing the dummification

answered Apr 05 '13 at 07:48

Ando Saabas

1,967
14
12

9

Still using integer encoding for categorical variables might induce the scikit-learn decision tree algorithm in error as integers are treated as numerical values with a total ordering: a split on 0.5 will put samples with feature 0 in the left branch and feature 1 and 2 to the right branch even if the algorithm was only really interested in making the distinction between 1 only and the others: 2 consecutive splits are required to isolate samples with feature 1 instead of just one if you use the dummy variables (aka boolean one-hot) encoding. – ogrisel Apr 05 '13 at 10:18
I just tried to use categorical variables directly by mapping the 20 possible names into 20 integers. I got 0.5% F-measure better (my dataset is imbalanced). So, seems there is no significant difference. – Lucy Apr 05 '13 at 21:13
@ogrisel I totally agree with you. Splitting a categorical feature vector of n categories into n or n-1 vectors is ideal. But what would you suggest when n or the number of possible categories is big. e.g. 1000 different categories in a vector of 20000 entries. would still split a single feature vector into 1000 boolean feature vectors of 20000 entries long ? this will kill the performance in scikit learn as far as I know. What would you do in this case ? – Cobry Dec 01 '15 at 02:59
3

My earlier comment from 2013 is misleading. Based on experience I have changed opinion on this matter: in practice, if the cardinality of the categorical variables is very large, using an arbitrary integer encoding for categorical features works well with scikit-learn forests and boosted trees. Note that 1-hot-encoding can be represented efficiently in memory with a scip sparse matrix datastructure. Recent version of scikit-learn support it as input for decision trees but integer coding just works better in many cases. – ogrisel Dec 01 '15 at 12:42
2

Also note that sparse representation for 1-hot-encoding can be efficiently processed by linear models (e.g. LogisticRegression). – ogrisel Dec 01 '15 at 12:43
@ogrisel: are you advising that only for ordinal categoricals, or nominals too? Does it make sense on nominals? – smci Nov 16 '16 at 13:54
My initial comment of Apr 5 '13 might be ill-informed: most randomized decision tree-based models will tend to work better (and more efficiently) with integer-encoded categorical variables (instead of one-hot encoded categorical variables). This is especially true if some categorical variables have a very large cardinality (e.g. as large as the number of remaining features). – ogrisel Nov 17 '16 at 09:01

How to use dummy variable to represent categorical data in python scikit-learn random forest

2 Answers2

Linked

Related