5

I want to impute the mean of a feature but only calculate the mean based off other examples that have the same category/nominal value in another column and I was wondering if this was possible using scikit-learn's Imputer class? It would just make it easier to add into a pipeline that way.

For example:

Using the Titanic dataset from kaggle: source

How would I go about imputing the mean fare per pclass. The thinking behind it being that people in different classes would have large differences in cost between tickets.

Update: After discussion with some people, the phrase I should have used was "imputing the mean within class".

I've looked into Vivek's comment below and will construct a generic pipeline function when I get time to do what I want :) I have a good idea of how to do it and will post as an answer when it's finished.

Community
  • 1
  • 1
  • 1
    You can split the data per `pclass`, impute the `fare` for them, and then stack them again to make complete data. – Vivek Kumar Mar 11 '17 at 02:25
  • Thanks @VivekKumar! I'll look into doing that as part of my pipeline – TheJokersThief Mar 13 '17 at 16:09
  • 1
    You can look at [this example](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py) to get hints for implementing your own class for doing it, which can be used in pipelines – Vivek Kumar Mar 13 '17 at 19:39
  • See also https://datascience.stackexchange.com/q/71856/55122 – Ben Reiniger May 13 '21 at 14:35

1 Answers1

1

So below is a pretty simple approach to my question that was just meant to handle the means of things. A more robust implementation would probably involve utilising the Imputer class from scikit learn which would mean it could also do mode, median, etc. and would be better at dealing with sparse/dense matrices.

This is based on Vivek Kumar's comment on the original question which suggested splitting the data into stacks and imputing it that way then re-assembling them.

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class WithinClassMeanImputer(BaseEstimator, TransformerMixin):
    def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan):
        self.missing_values = missing_values
        self.replace_col_index = replace_col_index
        self.y = None
        self.class_col_index = class_col_index

    def fit(self, X, y = None):
        self.y = y
        return self

    def transform(self, X):
        y = self.y
        classes = np.unique(y)
        stacks = []

        if len(X) > 1 and len(self.y) = len(X):
            if( self.class_col_index == None ):
                # If we're using the dependent variable
                for aclass in classes:
                    with_missing = X[(y == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(y == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean

                    stacks.append(np.concatenate((with_missing, without_missing)))
            else:
                # If we're using nominal values within a binarised feature (i.e. the classes
                # are unique values within a nominal column - e.g. sex)
                for aclass in classes:
                    with_missing = X[(X[:, self.class_col_index] == aclass) & 
                                        (X[:, self.replace_col_index] == self.missing_values)]
                    without_missing = X[(X[:, self.class_col_index] == aclass) & 
                                            (X[:, self.replace_col_index] != self.missing_values)]

                    column = without_missing[:, self.replace_col_index]
                    # Calculate mean from examples without missing values
                    mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values])

                    # Broadcast mean to all missing values
                    with_missing[:, self.replace_col_index] = mean
                    stacks.append(np.concatenate((with_missing, without_missing)))

            if len(stacks) > 1 :
                # Reassemble our stacks of values
                X = np.concatenate(stacks)

        return X
  • 2
    You should probably learn the relevant statistics at `fit` time, so that test data doesn't impute using its own distribution. – Ben Reiniger May 13 '21 at 14:30