1

I have a problem when i try to use categorical variables in pipeline. pd.get_dummies() is a terrific tool but we can not use it right in pipeline. So I had to use DictVectorizer(). I do it as below (toy example)

import numpy as np
import pandas as pd
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn import metrics
from xgboost.sklearn import XGBRegressor
df = pd.DataFrame({'a':[1,1,1,2,2,2],  'b':['a', 'a', 'a', 'b', 'b', 'b']     })

X = df[['b']]
y = df['a']

Then i build pipeline

class Cat():

    def transform(self, X, y=None, **fit_params):
        enc = DictVectorizer(sparse = False)
        encc = enc.fit(df[['b']].T.to_dict().values())
        enc_data = encc.transform(X.T.to_dict().values())
        return enc_data

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self
xgb = XGBRegressor()
pipeline = Pipeline([ 

                ('categorical', Cat()),
    ('model_fitting', xgb),
])
pipeline.fit(X, y)
metrics.r2_score(y, pipeline.predict(X))
0.9999985362431687

It works. Compare with pd.get_dummies()

X1 = pd.get_dummies(df['b'])
xgb.fit(X1, y)
metrics.r2_score(y, xgb.predict(X1))
0.9999985362431687

But the problem is that results on real data set of using pd.get_dummies() and DictVectorizer() are dramatically different. The real dataset has not NAN nor empty sell. It has tow variables 1) y - numeric and 2) string 'gender' (f -962, m - 140).
And R^2 for pd.get_dummies() 0.025946526223095123

R^2 for DictVectorizer() 0.00170802695618677 The problem does not depend on the sample size since I made

df = pd.DataFrame({ 'a': range(6000) ,  'b': ['а', 'м']*3000})

and results are identical

What could be the reason? thanx for your help

Edward
  • 4,443
  • 16
  • 46
  • 81
  • 1
    Can't see anything wrong. I suggest you take out the complexity of the pipeline and just try the first step using get_dummies and dictvectorizer; and compare the two outputs. – simon Jan 21 '17 at 18:25
  • i've done it. Identical datasets) – Edward Jan 21 '17 at 18:32
  • 1
    And when you run fit and score on these identical datasets? – simon Jan 21 '17 at 18:50
  • @simon sorry i've found the mistake((( i get my real data set by filtering and I had to reset index........ i'm idiot( – Edward Jan 21 '17 at 18:54
  • 1
    Could be because of the way you have defined cat(). Normally to be included in the pipeline your classifier needs to inherit from BaseEstimator. – simon Jan 21 '17 at 18:58

0 Answers0