4

I am trying to use the ExtraTreesClassifier in scikit-learn on my data. I have two numpy arrays X and y. X is of dimension (10000,51) and y is (10000,). To make sure they are in numpy array format, I use

X = numpy.array(X, dtype=np.float32)
print numpy.asarray(X,dtype=np.float32) is X
y = numpy.array(y, dtype=np.float32)
print numpy.asarray(y,dtype=np.float32) is y`

and I get TRUE for both. Then I define my model as:

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0, n_jobs = -1)`

And when I want to train my model using

clf = clf.fit(X, y)`

I get the following error:

File "CFD_scikit_learn.py", line 169, in <module>
clf = Xtra_Trees(my_var)
  File "CFD_scikit_learn.py", line 140, in Xtra_Trees
clf = clf.fit(X, y)
  File "/user/leuven/308/vsc30879/.local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 235, in fit
y, expanded_class_weight = self._validate_y_class_weight(y)
  File "/user/leuven/308/vsc30879/.local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 421, in _validate_y_class_weight
check_classification_targets(y)
  File "/user/leuven/308/vsc30879/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 173, in check_classification_targets
raise ValueError("Unknown label type: %r" % y)
ValueError: Unknown label type: array([[ 2.09895 ],
   [ 1.658568],
   [ 1.242831],
   ..., 
   [ 1.743349],
   [ 1.765763],
   [ 1.824112]])

If anybody knows how to solve this problem, be grateful if you let me know.

Vahid S. Bokharaie
  • 937
  • 1
  • 9
  • 25

3 Answers3

7

Classifiers need integer labels.

You either need to turn them into integers (e.g. bin them), or use a regression-type model.

If you think you can bin the floats into sensible classes, numpy.digitize might help. Or you could binarize them.

Matt Hall
  • 7,614
  • 1
  • 23
  • 36
2

y should be array of integers instead of floats. Each integer should represent some class.

Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52
0

The other way to binarize it

X = numpy.array(X, dtype='|Sx') where x states for the number of symbols required to represent your float number.

zina
  • 144
  • 11