How to create a preprocessing pipeline including built-in scikit learn transformers, custom transformers, one of which is for feature engineering?

Question

I am using this dataset:

https://www.kaggle.com/shahir/protein-data-set

Summary

I am struggling to create a preprocessing pipeline with built-in transformers and custom transformers that would include a one that would add additional attributes to the data and further perform transformations on the added attributes as well.

Examples of additional attributes:

There is a phValue attribute that has missing data. I would like to try creating an additional attribute, which would label the phValue to (Acid, Neutral, Base) in a phLabel column.
Also the string length of each sequence feature.

This would require Imputing the missing values of phValue, then creating additional attributes and further transformers that would also transform the sequence_length attribute.

My terrible transformer.

This is an example of how I create my custom transformers, that I could use for manual preprocessing, however, this is not the right way to approach it when creating a full preprocessing pipeline.

def data_to_frame(X):
    if isinstance(X, pd.DataFrame):
        return X
    elif isinstance(X, sparse.csr_matrix):
        return pd.DataFrame(X, indices, atributes)
    elif isinstance(X, np.ndarray):
        return pd.DataFrame(X, indices, atributes)
    else:
        raise Exception("Incorrect Data Structure Passed")

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, no_difference = True): # no *args or **kargs
        self.no_difference = no_difference
    def fit(self, X, y=None):
        return self # nothing else to do
    def transform(self, X):
        atributes.extend(['sequence_length', 'difference', 'phLabel'])
        sequence_length = X.sequence.str.len()
        difference = X['residueCount'] - sequence_length
        phLabel = X['phValue'].apply(ph_labels)
        if self.no_difference:
            atributes.append('no_difference')
            no_difference = (difference == 0)
            return np.c_[X, sequence_length, difference, phLabel, no_difference]
        else:
            return np.c_[X, sequence_length, difference, phLabel]

Pandas Operations in Transformers.

The operations I want to perform in the transformers are specific to pandas. My solution for it was to convert the input numpy array to a dataframe and return it as a numpy array in the transform function. I use global variables for attributes and indices. I realize that this is a lackluster approach. How could I use pandas operations in my custom transformers?

I came across this blog post, however I was not able to do it with Column Transformer: https://zablo.net/blog/post/pandas-dataframe-in-scikit-learn-feature-union/

Update:

Other issues with my pipeline. How do subsequent transformers work when specifying the columns to transform? Does it pass the whole set to each transformer, operate on the columns specified, and return the modified full set to other transformers? Also, not specifying columns for custom transformers seem to raise an error, even though they are not functional in my case, as I pass the arguments to the constructor. How should I alter my code?

If I comment out OrdinalEncoder and OneHotEncoder after fit_transform the ColumnTransformer outputs a numpy array with the shape: (rows, 72) There are 19 attributes and I drop 2 attributes in the FeatureSelector transformer. So I would expect to receive an array of (rows, 17) without OHE.

If I leave it as is I receive a: ValueError: Input contains NaN.

attributes is a global array of every column in my data set. I remove the values in FeatureSelector that I dropped.

# numeric_feat_eng + categ_feat_eng contains all of my attributes
prepoc_pipeline = make_column_transformer(
                  (SimpleImputer(strategy='mean'), numeric_feat_eng),
                  (SimpleImputer(strategy='most_frequent'), categ_feat_eng),
                  (FixAtributeValues(), attributes),
                  (CombinedAttributesAdder(), attributes),
                  (FeatureSelector(attributes_to_drop), attributes_to_drop),
                  (LogTransformation(atr_log_trans), atr_log_trans),
                  (StandardScaler(), numeric_feat_eng),
                  (OrdinalEncoder(), id_cols),
                  (OneHotEncoder(handle_unknown='ignore'), categ_without_ids)
)

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attributes_drop = ['pdbxDetails', 'sequence']):
        self.attributes_drop = attributes_drop
    def fit(self, X, y=None):
        return self # nothing else to do
    def transform(self, X):
        X = data_to_frame(X)
        for x in self.attributes_drop:
            attributes.remove(x)
        X = X.drop(columns=self.attributes_drop)
        return X

If anyone could guide me on how to do this it would be very much appreciated! Or provide me with sources where I could learn how to create pipelines.

score 1 · Accepted Answer · answered Aug 24 '20 at 02:00

This should work as expected - most likely there's something wrong with your implementation - may try working off a dummy dataset. The TransformerMixin does not really care whether the input is numpy or pandas.DataFrame, and it will work as "intended".

import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline


class CustomTransformer(TransformerMixin):
    def __init__(self, some_stuff=None, column_names= []):
        self.some_stuff = some_stuff
        self.column_names = column_names
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # do stuff on X, and return dataframe
        # of the same shape - this gets messy
        # if the preceding item is a numpy array
        # and not a dataframe
        if isinstance(X, np.ndarray):
            X = pd.DataFrame(X, columns=self.column_names)
        
        X['str_len'] = X['my_str'].apply(lambda x: str(x)).str.len()
        X['custom_func'] = X['val'].apply(lambda x: 1 if x > 0.5 else -1)
        return X


df = pd.DataFrame({
    'my_str': [111, 2, 3333],
    'val': [0, 1, 1]
})

# mixing this works as expected
my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"]))
my_pipeline.fit_transform(df)

# using this by itself works as well
my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"]))
my_pipeline.fit_transform(df)

Output is:

In [  ]: my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"])) 
    ...: my_pipeline.fit_transform(df)                                                                                                                                                                                                  
Out[  ]: 
     my_str       val  str_len  custom_func
0 -0.671543 -1.414214       19           -1
1 -0.742084  0.707107       18            1
2  1.413627  0.707107       17            1

In [  ]: my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"])) 
    ...: my_pipeline.fit_transform(df)                                                                                                                                                                                                  
Out[  ]: 
   my_str  val  str_len  custom_func
0     111    0        3           -1
1       2    1        1            1
2    3333    1        4            1

Alternatively, you can use sklearn-pandas if you want to map things to a dataframe directly

from sklearn_pandas import DataFrameMapper

# using sklearn-pandas
str_transformer = FunctionTransformer(lambda x: x.apply(lambda y: y.str.len()))
cust_transformer = FunctionTransformer(lambda x: (x > 0.5) *2 -1)


mapper = DataFrameMapper([
    (['my_str'], str_transformer),
    (['val'], make_pipeline(StandardScaler(), cust_transformer))
], input_df=True, df_out=True)

mapper.fit_transform(df)

Output:

In [  ]: mapper.fit_transform(df)                                                                                                                                                                                                       
Out[47]: 
   my_str  val
0       3   -1
1       2    1
2       1    1

Using sklearn pandas allows you to be more specific with the input being a dataframe and the output being a dataframe, and allows you to map each column individually to each pipeline of interest rather than encoding/hardcoding the column names as part of the TransformerMixin object.

Could you tell me how would you create a pipeline where you would need to specify a subset of the column names in each transformation? For Example a OneHotEncoder for my_str. — Syrnik, Aug 25 '20 at 22:35