I have some scraped data that needs some cleaning. After the cleaning, I want to create a "numerical and categorical pipelines" inside a ColumnTransformer such as:
categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns
num_pipeline = Pipeline(
steps=[
('scaler', StandardScaler())
]
)
cat_pipeline = Pipeline(
steps=[
('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer([
('num_pipeline', num_pipeline, numerical_cols),
('cat_pipeline', cat_pipeline, categorical_cols)
])
My idea was to create a transformer class Transformer(BaseEstimator, TransformerMixin):
and create a pipeline with it. That transformer would include all the cleaning steps. My problem is that some of the steps change the dtype from object to integer mostly so I'm thinking that instead of defining the categorical_cols and numerical_cols with dtypes, instead, do it with column names.
Would that be the correct approach? The idea would be to automate the process so I can train the model with new data every time.