I am trying to include a transformer in a scikit-learn pipeline that will bin a continuous data column into 4 values based on my own supplied cut points. The current arguments to KBinsDiscretizer do not work mainly because the strategy
argument only accepts {‘uniform’, ‘quantile’, ‘kmeans’}
.
There is already the cut()
function in pandas so I guess that I will need to create a custom transformer that wraps the cut()
function behavior.
Desired Behavior (not actual)
X = [[-2, -1, -0.5, 0, 0.5, 1, 2]]
est = Discretizer(bins=[-float("inf"), -1.0, 0.0, 1.0, float("inf")],
encode='ordinal')
est.fit(X)
est.transform(X)
# >>> array([[0., 0., 1., 1., 2., 2., 3.]])
The result above assumes that the bins includes the rightmost edge and include the lowest. Like this pd.cut()
command would provide:
import pandas as pd
import numpy as np
pd.cut(np.array([-2, -1, -0.5, 0, 0.5, 1, 2]),
[-float("inf"), -1.0, 0.0, 1.0, float("inf")],
labels=False, right=True, include_lowest=True)
# >>> array([0, 0, 1, 1, 2, 2, 3])