Sklearn Transform Different Columns Differently In Pipeline - Ex: X[col1] gets tfidf, X[col2] gets label encoding?

Question

I'm trying to understand how to perform different transformations for different columns. I know I need a Pipeline but I think I need FeatureUnion.

My Dataframe:

               text  labels    pred
0  this is a phrase   green  0.0134
1        so is this    blue  0.0231
2       this is too   green  0.0321
3     and i am done  yellow  0.0123

My Sample Code:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import TransformerMixin

df = pd.DataFrame({'text': ['this is a phrase', 'so is this', 'this is too', 'and i am done'],
           'labels': ['green', 'blue', 'green', 'yellow'],
           'pred': [0.0134, 0.0231, 0.0321, 0.0123]},
          columns=['text', 'labels', 'pred'])

X = df[['text', 'labels']]
y = df['pred']

pipeline = Pipeline(steps=[
  ('union', FeatureUnion(
    transformer_list=[
      ('bagofwords', Pipeline([
        # X['text'] processed here
        ('tfidf', TfidfVectorizer()),
        ])),
      ('encoder', Pipeline([
        # X['labels'] processed here
        ('le', LabelEncoder()), 
        ]))
      ])
   ),
  # join above steps back into single X and pass to LinearRegression??
  ('lr', LinearRegression()),
  ])

pipeline.fit(X, y)

If FeatureUnion is the solution, how do I tell the pipeline to use tfidf for X['text'], labelencoder for X['labels'], then combined them and send to LinearRegression?

Do I need custom transformers? If so, how would that work in this scenario?

Sklearn Transform Different Columns Differently In Pipeline - Ex: X[col1] gets tfidf, X[col2] gets label encoding?

0 Answers0