I'm trying to understand how to perform different transformations for different columns. I know I need a Pipeline
but I think I need FeatureUnion
.
My Dataframe:
text labels pred
0 this is a phrase green 0.0134
1 so is this blue 0.0231
2 this is too green 0.0321
3 and i am done yellow 0.0123
My Sample Code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import TransformerMixin
df = pd.DataFrame({'text': ['this is a phrase', 'so is this', 'this is too', 'and i am done'],
'labels': ['green', 'blue', 'green', 'yellow'],
'pred': [0.0134, 0.0231, 0.0321, 0.0123]},
columns=['text', 'labels', 'pred'])
X = df[['text', 'labels']]
y = df['pred']
pipeline = Pipeline(steps=[
('union', FeatureUnion(
transformer_list=[
('bagofwords', Pipeline([
# X['text'] processed here
('tfidf', TfidfVectorizer()),
])),
('encoder', Pipeline([
# X['labels'] processed here
('le', LabelEncoder()),
]))
])
),
# join above steps back into single X and pass to LinearRegression??
('lr', LinearRegression()),
])
pipeline.fit(X, y)
If FeatureUnion
is the solution, how do I tell the pipeline to use tfidf for X['text']
, labelencoder for X['labels']
, then combined them and send to LinearRegression
?
Do I need custom transformers? If so, how would that work in this scenario?