13

While working with the DecisionTreeClassifier I visualized it using graphviz, and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data.

All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5: enter image description here From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn.

Anyone knows a way that I am missing to use the tree categorically? (I know it is not better for the task but as I need categories currently I am using one hot vectors on the data).

A sample of the original data looks like this:

f1 f2 f3  f4  f5  f6  f7  f8  f9  f10  c1  c2  c3
0  C  S  O   1   2   1   1   2   1    2   0   0   0
1  D  S  O   1   3   1   1   2   1    2   0   0   0
2  C  S  O   1   3   1   1   2   1    1   0   0   0
3  D  S  O   1   3   1   1   2   1    2   0   0   0
4  D  A  O   1   3   1   1   2   1    2   0   0   0
5  D  A  O   1   2   1   1   2   1    2   0   0   0
6  D  A  O   1   2   1   1   2   1    1   0   0   0
7  D  A  O   1   2   1   1   2   1    2   0   0   0
8  D  K  O   1   3   1   1   2   1    2   0   0   0
9  C  R  O   1   3   1   1   2   1    1   0   0   0

where X[0] = f1 and I encoded strings to integers as sklearn does not accept strings.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Eytan
  • 728
  • 1
  • 7
  • 19
  • How can "the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5" while all your features are categorical?? Please post a sample of your data, otherwise your question is completely unclear... – desertnaut Dec 18 '17 at 20:16
  • If you haven't already, try to convert all the elements in your categorical features to strings and see if that fixes the problem. – Alex Dec 18 '17 at 20:25
  • @desertnaut it has 6 categories encoded as 0, 1, 2, 3, 4, 5. I'll add a sample though – Eytan Dec 18 '17 at 22:27
  • @Alex sklearn DecisionTree doesn't accept strings as a parameter as far as I know.. – Eytan Dec 18 '17 at 22:28

1 Answers1

18

Well, I am surprised, but it turns out that sklearn's decision tree cannot handle categorical data indeed. There is a Github issue on this (#4899) from June 2015, but it is still open (UPDATE: it is now closed, but continued in #12866, so the issue is still not resolved).

The problem with coding categorical variables as integers, as you have done here, is that it imposes an order on them, which may or may not be meaningful, depending on the case; for example, you could encode ['low', 'medium', 'high'] as [0, 1, 2], since 'low' < 'medium' < 'high' (we call these categorical variables ordinal), although you are still implicitly making the additional (and possibly undesired) assumption that the distance between 'low' and 'medium' is the same with the distance between 'medium' and 'high' (of no impact in decision trees, but of importance e.g. in k-nn and clustering). But this approach fails completely in cases like, say, ['red','green','blue'] or ['male','female'], since we cannot claim any meaningful relative order between them.

So, for non-ordinal categorical variables, the way to properly encode them for use in sklearn's decision tree is to use the OneHotEncoder module. The Encoding categorical features section of the user's guide might also be helpful.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 2
    Some problems might arise from using one hot encoding. Not only the dimensionality but also the tree depth might be deeper then the number of original features. Also I wanted to use the tree for teaching, using data in one hot vector format gives semantic equality with the theoretical decision tree which is good but is limiting when teaching (especially since we can plot the trees). – Eytan Dec 19 '17 at 11:12
  • @Eytan You are absolutely correct, and actually this is one reason for the (still open) issue at Github. You might want considering other tools that make the procedure more transparent for the end user, like R... – desertnaut Dec 19 '17 at 11:16
  • 1
    There is a catch here. If you are using sklearn package installed from pip, It handles categorical values with dtype=category (in current version). But if you install from Conda distribution, it throws error while fit method is called!! – Gana Jan 12 '19 at 06:42
  • 1
    Well, if there are 50 categories present in a categorical variable, it wouldn't be wise to one-hot-encode them, isn't it? How do you handle this case then? – Kristada673 Nov 08 '19 at 05:09
  • 1
    @Kristada673 There are other ways to encode categorical variables like BinaryEncoding. Check out the python package categorical encoders https://github.com/scikit-learn-contrib/categorical-encoding – Arun Jan 29 '20 at 23:14
  • So if you are using sklearn you must convert categorical column using OrdinalEncoder or OneHotEncoder right? and if both of them becomes a problem, need to use different library that handles categorial data as is (such as one mentioned above)? – haneulkim Aug 31 '21 at 02:04