scikit-learn transformer that bins data based on user supplied cut points

Question

I am trying to include a transformer in a scikit-learn pipeline that will bin a continuous data column into 4 values based on my own supplied cut points. The current arguments to KBinsDiscretizer do not work mainly because the strategy argument only accepts {‘uniform’, ‘quantile’, ‘kmeans’}.

There is already the cut() function in pandas so I guess that I will need to create a custom transformer that wraps the cut() function behavior.

Desired Behavior (not actual)

X = [[-2, -1, -0.5, 0, 0.5, 1, 2]]
est = Discretizer(bins=[-float("inf"), -1.0, 0.0, 1.0, float("inf")], 
                  encode='ordinal')
est.fit(X)  
est.transform(X)
# >>> array([[0., 0., 1., 1., 2., 2., 3.]])

The result above assumes that the bins includes the rightmost edge and include the lowest. Like this pd.cut() command would provide:

import pandas as pd
import numpy as np
pd.cut(np.array([-2, -1, -0.5, 0, 0.5, 1, 2]),
       [-float("inf"), -1.0, 0.0, 1.0, float("inf")], 
       labels=False, right=True, include_lowest=True)
# >>> array([0, 0, 1, 1, 2, 2, 3])

Steven M. Mortimer · Accepted Answer · 2019-08-30T01:38:49.003

This is what seems to work for me as a custom transformer. scikit-learn expects arrays of numerics so I'm not sure if you can implement the feature of pd.cut() that will return the labels. For this reason I've hard coded it to False in the implementation below.

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class CutTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, bins, right=True, retbins=False,
                 precision=3, include_lowest=False,
                 duplicates='raise'):
        self.bins = bins
        self.right = right
        self.labels = False
        self.retbins = retbins
        self.precision = precision
        self.include_lowest = include_lowest
        self.duplicates = duplicates

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame)
        for jj in range(X.shape[1]):
            X.iloc[:, jj] = pd.cut(x=X.iloc[:, jj].values, **self.__dict__)
        return X

An Example

df = pd.DataFrame(data={'rand': np.random.rand(5)})
df
    rand
0   0.030653
1   0.542533
2   0.159646
3   0.963112
4   0.539530

ct = CutTransformer(bins=np.linspace(0, 1, 5))
ct.transform(df)
    rand
0   0
1   2
2   0
3   3
4   2

score 0 · Answer 2 · answered Aug 30 '19 at 01:32

An alternative to a custom transformer, which has more overhead, would be to use the FunctionTransformer() method which is good for stateless operations like this case where the bins are predefined.

import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline

def ftransformer_cut(X, **kwargs):
    if 'labels' not in kwargs:
        kwargs['labels'] = False

    assert isinstance(X, np.ndarray)
    assert kwargs['labels'] == False

    for jj in range(X.shape[1]):
        X[:, jj] = pd.cut(x=X[:, jj], **kwargs)

    return X

pipeline = make_pipeline(
    FunctionTransformer(ftransformer_cut,
                        kw_args={'bins': np.linspace(0, 1, 5)})
)

df = pd.DataFrame(data={'rand': np.random.rand(5)})
    rand
0   0.823234
1   0.336883
2   0.713595
3   0.408184
4   0.038

pipeline.transform(df)
array([[3.],
       [1.],
       [2.],
       [1.],
       [0.]])

score 0 · Answer 3 · answered Apr 28 '21 at 18:14

The only issue there is that you are only transforming the incoming data, not learning the bins from the training during fit stage and using that information during transform stage. Ideally you should be learning the bin edges during fit and assigning the bins during transform.

scikit-learn transformer that bins data based on user supplied cut points

3 Answers3