I am trying to generate a pipeline using sklearn, and am not really sure how to go about it. Here is a minimal example:
def numFeat(data):
return data[['AGE', 'WASTGIRF']]
def catFeat(data):
return pd.get_dummies(data[['PAI', 'smokenow1']])
features = FeatureUnion([('f1',FunctionTransformer(numFeat)),
('f2',FunctionTransformer(catFeat)) ] )
pipeline = Pipeline( [('f', features), ('lm',LinearRegression())] )
data = pd.DataFrame({'AGE':[1,2,3,4],
'WASTGIRF': [23,5,43,1],
'PAI':['a','b','a','d'],
'smokenow1': ["lots", "some", "none", "some"]})
pipeline.fit(data, y)
print pipeline.transform(data)
In the above example, data
is a Pandas DataFrame that contains the columns ['AGE', 'WASTGIRF', 'PAI', 'smokenow1']
among others.
Of course, in the FeatureUnion
example, I want to supply many more transformation operations, but, all of them take a Pandas DataFrame and return another Pandas DataFrame. So in effect, I want to do something like this ...
data --+-->num features-->num transforms--+-->FeatureUnion-->model
| |
+-->cat features-->cat transforms--+
How do I go about doing this?
For the example above, the error i get is ...
TypeError: float() argument must be a string or a number