0

I am trying to generate a pipeline using sklearn, and am not really sure how to go about it. Here is a minimal example:

def numFeat(data):
    return data[['AGE', 'WASTGIRF']]

def catFeat(data):
    return pd.get_dummies(data[['PAI', 'smokenow1']])

features = FeatureUnion([('f1',FunctionTransformer(numFeat)),
                         ('f2',FunctionTransformer(catFeat))  ] )

pipeline = Pipeline( [('f', features), ('lm',LinearRegression())] )

data = pd.DataFrame({'AGE':[1,2,3,4], 
                     'WASTGIRF': [23,5,43,1], 
                     'PAI':['a','b','a','d'], 
                     'smokenow1': ["lots", "some", "none", "some"]})

pipeline.fit(data, y)
print pipeline.transform(data)

In the above example, data is a Pandas DataFrame that contains the columns ['AGE', 'WASTGIRF', 'PAI', 'smokenow1'] among others.

Of course, in the FeatureUnion example, I want to supply many more transformation operations, but, all of them take a Pandas DataFrame and return another Pandas DataFrame. So in effect, I want to do something like this ...

data --+-->num features-->num transforms--+-->FeatureUnion-->model
       |                                  |
       +-->cat features-->cat transforms--+

How do I go about doing this?

For the example above, the error i get is ...

TypeError: float() argument must be a string or a number
maxymoo
  • 35,286
  • 11
  • 92
  • 119
ssm
  • 5,277
  • 1
  • 24
  • 42

1 Answers1

1

You need to initialise FunctionTransformer with validate=False (IMO this is a bad default that should be changed):

features = FeatureUnion([('f1',FunctionTransformer(numFeat, validate=False)),
                         ('f2',FunctionTransformer(catFeat, validate=False))] )

See also sklearn pipeline - how to apply different transformations on different columns

Community
  • 1
  • 1
maxymoo
  • 35,286
  • 11
  • 92
  • 119