One-hot-encoding with missing categories

Question

I have a dataset with a category column. In order to use linear regression, I 1-hot encode this column.

My set has 10 columns, including the category column. After dropping that column and appending the 1-hot encoded matrix, I end up with 14 columns (10 - 1 + 5).

So I train (fit) my LinearRegression model with a matrix of shape (n, 14).

After training it, I want to test it on a subset of the training set, so I take only the 5 first and put them through the same pipeline. But these 5 first only contain 3 of the categories. So after going through the pipeline, I'm only left with a matrix of shape (n, 13) because it's missing 2 categories.

How can I force the 1-hot encoder to use the 5 categories ?

I'm using LabelBinarizer from sklearn.

You should have only used LabelBinarizer.transform() on the new data. Never fit(). Show the code and we will modify it suit the needs. — Vivek Kumar, Feb 21 '18 at 05:58
That was it @VivekKumar - once the transformer (in this case LabelBinarizer) has been fitted, it should not be re-fitted. Thanks! — lipsumar, Feb 21 '18 at 14:06

score 7 · Accepted Answer · answered Feb 21 '18 at 14:18

The error is to "put the test data through the same pipeline". Basically i was doing:

data_prepared = full_pipeline.fit_transform(train_set)

lin_reg = LinearRegression()
lin_reg.fit(data_prepared, labels)

some_data = train_set.iloc[:5]
some_data_prepared = full_pipeline.fit_transform(some_data)

lin_reg.predict(some_data_prepared)
# => error because mismatching shapes

The problematic line is:

some_data_prepared = full_pipeline.fit_transform(some_data)

By doing fit_transform, I'll fit the LabelBinarizer to a set containing only 3 labels. Instead I should do:

some_data_prepared = full_pipeline.transform(some_data)

This way I'm using the pipeline fitted by the full set (train_set) and transform it in the same way.

Thanks @Vivek Kumar

joaoavf · Answer 2 · 2018-02-20T19:43:50.430

I have run into this issue, and I couldn't find a solution through scikit-learn.

I am using pandas .get_dummies() to do something similar to OneHotEncoder.

Below follows a function I made to deal with this exact issue, feel free to use it and improve it (and please let me know if you find any errors, I actually just made it from a more specific function I had in my codebase):

import numpy as np
import pandas as pd

def one_hot_encoding_fixed_columns(pandas_series, fixed_columns):

    # Creates complete fixed columns list (with nan and 'other')
    fixed_columns = list(fixed_columns)
    fixed_columns.extend([np.nan, 'other'])

    # Get dummies dataset
    ohe_df = pd.get_dummies(pandas_series, dummy_na=True)

    # Create blank 'other' column
    ohe_df['other'] = 0

    # Check if columns created by get_dummies() are in 'fixed_columns' list.
    for column in ohe_df.columns:

        if column not in fixed_columns:
            # If not in 'fixed_columns', transforms exceeding column into 'other'.
            ohe_df['other'] = ohe_df['other'] + ohe_df[column]
            ohe_df.drop(columns=[column])

    # Check if elements in 'fixed_columns' are in the df generated by get_dummies()
    for column in fixed_columns:

        if column not in ohe_df.columns:
            # If the element is not present, create a new column with all values set to 0.
            ohe_df['column'] = 0

    # Reorders columns according to fixed columns
    ohe_df = ohe_df[fixed_columns]

    return ohe_df

Basically, you create a list with columns that will always be used. If the test sample doesn't not have any elements of a given category a corresponding column with values = 0 is created. If the test has a new value that wasn't in the train sample, it is categorized as other.

I have commented out the code and I hope it is understandable, if you have any questions just let me know and I'll clarify it.

Input of this function is pandas_series = df['column_name'], and you could do something like fixed_columns = df[selected_column].str[0].value_counts().index.values on the training set to generate the values to will also be used on the test set.

score -1 · Answer 3 · answered Nov 04 '20 at 10:27

-1

Basically first we need to apply fit_transform for the base data and next apply transform for the sample data, so sample data also will get the exact no.of columns w.r.t base data.

answered Nov 04 '20 at 10:27

Rajendra Batchu

1

One-hot-encoding with missing categories

3 Answers3