Sklearn: How to Feed Data to sklearn RandomForestClassifier

Question

I have this data:

print training_data
print labels

# prints

[[1, 0, 1, 1], [1, 1, 1, 1], [1, 0, 1, 1], [1, 1, 1, 0], [1, 1, 0, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 0,0], [1, 1, 1, 1], [1, 0, 1, 1]]
['a', 'b', 'a', 'b', 'a', 'b', 'b', 'a', 'a', 'a', 'b']

And am trying to feed it to a RandomForestClassifier from the sklearn python library.

classifier = RandomForestClassifier(n_estimators=10)
classifier.fit(training_data, labels)

But receive this error:

Traceback (most recent call last):
  File "learn.py", line 52, in <module>
    main()
  File "learn.py", line 48, in main
    classifier = train_classifier()
  File "learn.py", line 33, in train_classifier
    classifier.fit(training_data, labels)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 348, in fit
    y = np.ascontiguousarray(y, dtype=DOUBLE)
  File "/Library/Python/2.7/site-packages/numpy-1.8.0.dev_bbcfcf6_20130307-py2.7-macosx-10.8-intel.egg/numpy/core/numeric.py", line 419, in ascontiguousarray
    return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: could not convert string to float: a

My guess is that I am not formatting this data correctly for fitting. But I do not understand why from the documentation

This seems like a pretty basic, simple issue. Anyone know the answer?

Wild guess, try with numerical values: e.g. instead of `'a'/'b'` with `0/1`. — Matt, Apr 07 '13 at 19:35
Ok, I will but that will be a major disappointment, since for decision trees, the labels need not be numeric. I cant imagine the sklearn authors would do that. — David Williams, Apr 07 '13 at 19:36
possible duplicate of [Non-Integer Class Labels Scikit-Learn](http://stackoverflow.com/questions/13300160/non-integer-class-labels-scikit-learn) — BrenBarn, Apr 07 '13 at 19:45

Matt · Answer 1 · 2013-04-07T19:54:22.177

7

Try transforming your labels beforehand using the LabelEncoder.

edited Apr 07 '13 at 19:54

answered Apr 07 '13 at 19:44

Matt

17,290
7
57
71

score 0 · Answer 2 · answered May 27 '15 at 16:01

You could use numpy arrays which are automatically recognised by the classifier, as below:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
np_training = np.array(training_data)
np_labels = np.array(labels)

clf = RandomForestClassifier(n_estimators=20, max_depth=5)
clf.fit(np_training, np_labels)

That should work

Sklearn: How to Feed Data to sklearn RandomForestClassifier

2 Answers2