1

I have some scraped data that needs some cleaning. After the cleaning, I want to create a "numerical and categorical pipelines" inside a ColumnTransformer such as:

categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(exclude='object').columns

num_pipeline = Pipeline(
    steps=[
    ('scaler', StandardScaler())
    ]
)

cat_pipeline = Pipeline(
    steps=[
        ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, numerical_cols),
    ('cat_pipeline', cat_pipeline, categorical_cols)
])

My idea was to create a transformer class Transformer(BaseEstimator, TransformerMixin): and create a pipeline with it. That transformer would include all the cleaning steps. My problem is that some of the steps change the dtype from object to integer mostly so I'm thinking that instead of defining the categorical_cols and numerical_cols with dtypes, instead, do it with column names.

Would that be the correct approach? The idea would be to automate the process so I can train the model with new data every time.

Odiseon
  • 23
  • 3

2 Answers2

2

Instead of making a list of columns beforehand you can use scikit-learn's make_column_selector to dynamically specify the columns that each transformer will be applied to.

In your example:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, selector(dtype_exclude=object)),
    ('cat_pipeline', cat_pipeline, selector(dtype_include=object))
])

Under the hood it uses pandas' select_dtypes for the type selection. You can pass a regex and select based on column name as well.

I also recommend you checking out make_column_transformer for more control over the pipeline.

CarlosGDCJ
  • 424
  • 1
  • 8
1

the process is OK, as you said the type changes, and on many occasions, you encode the data to use it. To prevent this from happening label columns as categorical and numerical then change their types as you wish; for example use LabelEncoder. in many situations, a missing value makes an integer column into an object making you miserable in reporting results. so forget about total automation in this field and try methods to get each columns dtype and save them, then give the data to the pipeline.

# Define numerical and categorical columns
numerical_cols = ['numerical_col_1', 'numerical_col_2', ...]
categorical_cols = ['categorical_col_1', 'categorical_col_2', ...]

num_pipeline = Pipeline(
    steps=[        ('scaler', StandardScaler())    ]
)

cat_pipeline = Pipeline(
    steps=[        ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))    ]
)

preprocessor = ColumnTransformer([    ('num_pipeline', num_pipeline, numerical_cols),    ('cat_pipeline', cat_pipeline, categorical_cols)])

With this modification, you can update the numerical_cols and categorical_cols lists whenever you have new data with different columns, and the pipeline will adapt accordingly.

you can always do this methods and methods like this to find each columns dtype.

non_integer_columns = []
new_data = data.dropna()
for col in data.columns:
   try:
      new_data[col] = new_data[col].astype(int)
   except:
     non_integer_columns.append(col)
Sadegh
  • 125
  • 11