Python RandomForest - Unknown label Error

Question

I have trouble using RandomForest fit function

This is my training set

         P1      Tp1           IrrPOA     Gz          Drz2
0        0.0     7.7           0.0       -1.4        -0.3
1        0.0     7.7           0.0       -1.4        -0.3
2        ...     ...           ...        ...         ...
3        49.4    7.5           0.0       -1.4        -0.3
4        47.4    7.5           0.0       -1.4        -0.3
... (10k rows)

I want to predict P1 thanks to all the other variables using sklearn.ensemble RandomForest

colsRes = ['P1']
X_train = train.drop(colsRes, axis = 1)
Y_train = pd.DataFrame(train[colsRes])
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, Y_train)

Here is the error I get:

ValueError: Unknown label type: array([[  0. ],
       [  0. ],
       [  0. ],
       ..., 
       [ 49.4],
       [ 47.4],

I did not find anything about this label error, I use Python 3.5. Any advice would be a great help !

which version of sklearn you are using? – Gurupad Hegde Dec 13 '15 at 00:41 — Gurupad Hegde, Dec 13 '15 at 00:41

score 22 · Accepted Answer · edited May 23 '17 at 12:18

22

When you are passing label (y) data to rf.fit(X,y), it expects y to be 1D list. Slicing the Panda frame always result in a 2D list. So, conflict raised in your use-case. You need to convert the 2D list provided by pandas DataFrame to a 1D list as expected by fit function.

Try using 1D list first:

Y_train = list(train.P1.values)

If this does not solve the problem, you can try with solution mentioned in MultinomialNB error: "Unknown Label Type":

Y_train = np.asarray(train['P1'], dtype="|S6")

So your code becomes,

colsRes = ['P1']
X_train = train.drop(colsRes, axis = 1)
Y_train = np.asarray(train['P1'], dtype="|S6")
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, Y_train)

edited May 23 '17 at 12:18

Community

1
1

answered Dec 13 '15 at 00:05

Gurupad Hegde

2,155
15
30

I understand the problem now, but your code does not do the work. I am looking for other ways to do the conversion – Dragonfly Dec 13 '15 at 00:18
Can you try with `Y_train = list(train.P1.values)`? Let me know the error message if there is any – Gurupad Hegde Dec 13 '15 at 00:20
Error message does not change. Y_train shape looks good though print(Y_train) [ 0. 0. 0. ..., 49.4 47.4 45.4] Y shape gives (34208,) - #34208 is the number of rows – Dragonfly Dec 13 '15 at 00:28
1

Thanks for the error message. Can you try this: `Y_train = np.asarray(train['P1'], dtype="|S6")` – Gurupad Hegde Dec 13 '15 at 00:31
It worked, thanks a lot ! My PC crashed three times because it could not support the calculus though... – Dragonfly Dec 13 '15 at 01:51

score 9 · Answer 2 · edited May 23 '17 at 12:09

9

According to this SO post, Classifiers need integer or string labels.

You could consider switching to a regression model instead (that might better suit your data, as each datum appears to be a float), like so:

X_train = train.drop('P1', axis=1)
Y_train = train['P1']
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train.as_matrix(), Y_train.as_matrix())

edited May 23 '17 at 12:09

Community

1
1

answered Dec 12 '15 at 23:52

Nelewout

6,281
3
29
39

Thanks but no difference – Dragonfly Dec 12 '15 at 23:58
From what I understand about complete error report, from the line where I call the 'fit' function From the report: rf.fit(X_train, Y_train) File "C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py", line 235, in fit y, expanded_class_weight = self._validate_y_class_weight(y) ... – Dragonfly Dec 13 '15 at 00:06
@Dragonfly, I'm terribly sorry for taking so long, but I hope this answers your question. – Nelewout Dec 13 '15 at 00:35
I think even strings can be used instead of floats. – Gurupad Hegde Dec 13 '15 at 00:35
@GurupadHedge, yes they can! But since this data is just floats, a regression solution _might_ be a more sensible option. – Nelewout Dec 13 '15 at 00:36
1

Looking into the problem, as all fields in training set are floats. So, I think **regression** be more useful than classifier. – Gurupad Hegde Dec 13 '15 at 00:38
1

Regression works fine! Thanks a lot ! I used RandomForestRegressor instead of ExtraTreesRegressor. I will try to figure out the difference between those two – Dragonfly Dec 13 '15 at 01:49

RunD.M.C. · Answer 3 · 2017-12-08T21:46:16.123

1

may be a tad late to the party but I just got this error and solved it by making sure my y variable was type(int) using

 y = df['y_variable'].astype(int)

before doing a train test split, also like others have said you problem seems better fit with a RFReg rather then RF

edited Dec 08 '17 at 21:46

answered Dec 07 '17 at 19:04

RunD.M.C.

31
4

'y_variable' spelling – JDOaktown Dec 07 '17 at 19:25

Python RandomForest - Unknown label Error

3 Answers3

Linked