0

I have a strange error, that I could not understand. I have a data:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn_pandas import DataFrameMapper

test = pd.DataFrame({"a": ['a','c','-','9','c','a','a','c','b','i','c','r'],
                     "b": [0,0,1,0,0,1, 0,0,1,0,0,1] })

Then I make DataFrameMapper()

Mapper = DataFrameMapper([ ('a', LabelEncoder()) ])

Then Pipeline()

 pipeline = Pipeline([('featurize', Mapper),('forest',RandomForestClassifier())])
 X = test[test.columns.drop('b')]
 y = test['b']
 model = pipeline.fit(X = X, y = y)

Everything works fine, i can predict with this model. But, when I do cross_val_score

cross_val_score(pipeline, X, y, 'accuracy', cv=2)

It returns error:

a: y contains new labels: ['-' '9']

How can I avoid this or why does it work this way? Because I thought that LabelEncoder fits the data first, then cross-validation goes. I have tried to fit encoder firstly

enc = LabelEncoder()
enc.fit(test['a'])

on entire column then insert in Mapper, but it doesn't work

Shin
  • 251
  • 1
  • 3
  • 8
  • 1
    Inside cross_val_score, everything will be cloned, so your fitting outside will not work. You need to replace the `a` column with the label encoded data before sending to cross_val_score – Vivek Kumar Dec 21 '17 at 12:47
  • @VivekKumar it’s what mapper doing(replace column a), and this step is first in pipeline, no? – Shin Dec 21 '17 at 13:25
  • @VivekKumar Maybe I’ve got it, first cross-val_score split data, then running a pipeline. But authors on the page using cross_val_score without any doubts https://github.com/scikit-learn-contrib/sklearn-pandas – Shin Dec 21 '17 at 13:36
  • Then you need to use cross_val_score from sklearn-pandas. Currently you are using it from sklearn. But I'm not sure if that would help. – Vivek Kumar Dec 21 '17 at 14:20

0 Answers0