5

I have a dataset with a category column. In order to use linear regression, I 1-hot encode this column.

My set has 10 columns, including the category column. After dropping that column and appending the 1-hot encoded matrix, I end up with 14 columns (10 - 1 + 5).

So I train (fit) my LinearRegression model with a matrix of shape (n, 14).

After training it, I want to test it on a subset of the training set, so I take only the 5 first and put them through the same pipeline. But these 5 first only contain 3 of the categories. So after going through the pipeline, I'm only left with a matrix of shape (n, 13) because it's missing 2 categories.

How can I force the 1-hot encoder to use the 5 categories ?

I'm using LabelBinarizer from sklearn.

lipsumar
  • 944
  • 8
  • 22
  • 1
    You should have only used LabelBinarizer.transform() on the new data. Never fit(). Show the code and we will modify it suit the needs. – Vivek Kumar Feb 21 '18 at 05:58
  • That was it @VivekKumar - once the transformer (in this case LabelBinarizer) has been fitted, it should not be re-fitted. Thanks! – lipsumar Feb 21 '18 at 14:06

3 Answers3

7

The error is to "put the test data through the same pipeline". Basically i was doing:

data_prepared = full_pipeline.fit_transform(train_set)

lin_reg = LinearRegression()
lin_reg.fit(data_prepared, labels)

some_data = train_set.iloc[:5]
some_data_prepared = full_pipeline.fit_transform(some_data)

lin_reg.predict(some_data_prepared)
# => error because mismatching shapes

The problematic line is:

some_data_prepared = full_pipeline.fit_transform(some_data)

By doing fit_transform, I'll fit the LabelBinarizer to a set containing only 3 labels. Instead I should do:

some_data_prepared = full_pipeline.transform(some_data)

This way I'm using the pipeline fitted by the full set (train_set) and transform it in the same way.

Thanks @Vivek Kumar

lipsumar
  • 944
  • 8
  • 22
1

I have run into this issue, and I couldn't find a solution through scikit-learn.

I am using pandas .get_dummies() to do something similar to OneHotEncoder.

Below follows a function I made to deal with this exact issue, feel free to use it and improve it (and please let me know if you find any errors, I actually just made it from a more specific function I had in my codebase):

import numpy as np
import pandas as pd

def one_hot_encoding_fixed_columns(pandas_series, fixed_columns):

    # Creates complete fixed columns list (with nan and 'other')
    fixed_columns = list(fixed_columns)
    fixed_columns.extend([np.nan, 'other'])

    # Get dummies dataset
    ohe_df = pd.get_dummies(pandas_series, dummy_na=True)

    # Create blank 'other' column
    ohe_df['other'] = 0

    # Check if columns created by get_dummies() are in 'fixed_columns' list.
    for column in ohe_df.columns:

        if column not in fixed_columns:
            # If not in 'fixed_columns', transforms exceeding column into 'other'.
            ohe_df['other'] = ohe_df['other'] + ohe_df[column]
            ohe_df.drop(columns=[column])

    # Check if elements in 'fixed_columns' are in the df generated by get_dummies()
    for column in fixed_columns:

        if column not in ohe_df.columns:
            # If the element is not present, create a new column with all values set to 0.
            ohe_df['column'] = 0

    # Reorders columns according to fixed columns
    ohe_df = ohe_df[fixed_columns]

    return ohe_df

Basically, you create a list with columns that will always be used. If the test sample doesn't not have any elements of a given category a corresponding column with values = 0 is created. If the test has a new value that wasn't in the train sample, it is categorized as other.

I have commented out the code and I hope it is understandable, if you have any questions just let me know and I'll clarify it.

Input of this function is pandas_series = df['column_name'], and you could do something like fixed_columns = df[selected_column].str[0].value_counts().index.values on the training set to generate the values to will also be used on the test set.

joaoavf
  • 1,343
  • 1
  • 12
  • 25
-1

Basically first we need to apply fit_transform for the base data and next apply transform for the sample data, so sample data also will get the exact no.of columns w.r.t base data.