I have a strange error, that I could not understand. I have a data:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn_pandas import DataFrameMapper
test = pd.DataFrame({"a": ['a','c','-','9','c','a','a','c','b','i','c','r'],
"b": [0,0,1,0,0,1, 0,0,1,0,0,1] })
Then I make DataFrameMapper()
Mapper = DataFrameMapper([ ('a', LabelEncoder()) ])
Then Pipeline()
pipeline = Pipeline([('featurize', Mapper),('forest',RandomForestClassifier())])
X = test[test.columns.drop('b')]
y = test['b']
model = pipeline.fit(X = X, y = y)
Everything works fine, i can predict with this model. But, when I do cross_val_score
cross_val_score(pipeline, X, y, 'accuracy', cv=2)
It returns error:
a: y contains new labels: ['-' '9']
How can I avoid this or why does it work this way? Because I thought that LabelEncoder fits the data first, then cross-validation goes. I have tried to fit encoder firstly
enc = LabelEncoder()
enc.fit(test['a'])
on entire column then insert in Mapper, but it doesn't work