I am using this dataset:
https://www.kaggle.com/shahir/protein-data-set
Summary
I am struggling to create a preprocessing pipeline with built-in transformers and custom transformers that would include a one that would add additional attributes to the data and further perform transformations on the added attributes as well.
Examples of additional attributes:
- There is a phValue attribute that has missing data. I would like to try creating an additional attribute, which would label the phValue to (Acid, Neutral, Base) in a phLabel column.
- Also the string length of each sequence feature.
This would require Imputing the missing values of phValue, then creating additional attributes and further transformers that would also transform the sequence_length attribute.
My terrible transformer.
This is an example of how I create my custom transformers, that I could use for manual preprocessing, however, this is not the right way to approach it when creating a full preprocessing pipeline.
def data_to_frame(X):
if isinstance(X, pd.DataFrame):
return X
elif isinstance(X, sparse.csr_matrix):
return pd.DataFrame(X, indices, atributes)
elif isinstance(X, np.ndarray):
return pd.DataFrame(X, indices, atributes)
else:
raise Exception("Incorrect Data Structure Passed")
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, no_difference = True): # no *args or **kargs
self.no_difference = no_difference
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
atributes.extend(['sequence_length', 'difference', 'phLabel'])
sequence_length = X.sequence.str.len()
difference = X['residueCount'] - sequence_length
phLabel = X['phValue'].apply(ph_labels)
if self.no_difference:
atributes.append('no_difference')
no_difference = (difference == 0)
return np.c_[X, sequence_length, difference, phLabel, no_difference]
else:
return np.c_[X, sequence_length, difference, phLabel]
Pandas Operations in Transformers.
The operations I want to perform in the transformers are specific to pandas. My solution for it was to convert the input numpy array to a dataframe and return it as a numpy array in the transform function. I use global variables for attributes and indices. I realize that this is a lackluster approach. How could I use pandas operations in my custom transformers?
I came across this blog post, however I was not able to do it with Column Transformer: https://zablo.net/blog/post/pandas-dataframe-in-scikit-learn-feature-union/
Update:
Other issues with my pipeline. How do subsequent transformers work when specifying the columns to transform? Does it pass the whole set to each transformer, operate on the columns specified, and return the modified full set to other transformers? Also, not specifying columns for custom transformers seem to raise an error, even though they are not functional in my case, as I pass the arguments to the constructor. How should I alter my code?
If I comment out OrdinalEncoder and OneHotEncoder after fit_transform the ColumnTransformer outputs a numpy array with the shape: (rows, 72) There are 19 attributes and I drop 2 attributes in the FeatureSelector transformer. So I would expect to receive an array of (rows, 17) without OHE.
If I leave it as is I receive a: ValueError: Input contains NaN.
attributes
is a global array of every column in my data set. I remove the values in FeatureSelector that I dropped.
# numeric_feat_eng + categ_feat_eng contains all of my attributes
prepoc_pipeline = make_column_transformer(
(SimpleImputer(strategy='mean'), numeric_feat_eng),
(SimpleImputer(strategy='most_frequent'), categ_feat_eng),
(FixAtributeValues(), attributes),
(CombinedAttributesAdder(), attributes),
(FeatureSelector(attributes_to_drop), attributes_to_drop),
(LogTransformation(atr_log_trans), atr_log_trans),
(StandardScaler(), numeric_feat_eng),
(OrdinalEncoder(), id_cols),
(OneHotEncoder(handle_unknown='ignore'), categ_without_ids)
)
class FeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, attributes_drop = ['pdbxDetails', 'sequence']):
self.attributes_drop = attributes_drop
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X):
X = data_to_frame(X)
for x in self.attributes_drop:
attributes.remove(x)
X = X.drop(columns=self.attributes_drop)
return X
If anyone could guide me on how to do this it would be very much appreciated! Or provide me with sources where I could learn how to create pipelines.