I am looking to run classification on a column that has few possible values, but i want to consolidate them into fewer labels.
for example, a job may have multiple end states: success
, fail
, error
, killed
. but i am looking to classify the jobs into either a group of end states (which would include error
and killed
) and another group (which will only include success
and fail
).
I cannot find a way to do that in sklearn's LabelEncoder, and other than manually changing the target column myself (by assigning 1
to success
or fail
and 0
to everything else) i cannot find a way.
EDIT example. this is what i need to happen:
>>> label_binarize(['success','fail','error','killed', 'success'], classes=(['success', 'fail']))
array([[1],
[1],
[0],
[0],
[1]])
unfortunately, label_binarize
(or LabelBinarizer, for that matter) does it for each column separately. THIS IS NOT WHAT I WANT:
>>> label_binarize(['success','fail','error','killed', 'success'], classes=['success', 'fail'])
array([[1, 0],
[0, 1],
[0, 0],
[0, 0],
[1, 0]])
any ideas on how to do that?