I am looking to use scikit-learn's KMeans to cluster a set of variables into K bins and then use OneHotEncoder to binarize the columns. I'd like to use this functionality in a pipeline, but I think I'm having problems since KMeans uses the fit_predict()
method to return the class and not fit_transform()
.
Here is some sample code:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
foo = np.random.randn(100, 50)
km = KMeans(3)
ohe = OneHotEncoder()
bar = km.fit_predict(foo)
ohe.fit_transform(bar.reshape(-1, 1))
This returns the expected 100x3 matrix:
<100x3 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
If I stick KMeans in a pipeline:
pipeline = Pipeline([
('kmeans', KMeans(3))
])
pipeline.fit_predict(foo)
It returns the non binarized classes:
array([1, 2, 2, 0, ... , 1])
However if I use both KMeans and OneHotEncoder, KMeans feeds its fit_transform()
method into OneHotEncoder which "transforms X to cluster-distance space":
pipeline = Pipeline([
('cluster', KMeans(5)),
('one_hot', OneHotEncoder())
])
pipeline.fit_transform(foo)
It returns all the linear distances one-hot encoded and a 100x25 array:
<100x25 sparse matrix of type '<class 'numpy.float64'>'
with 500 stored elements in Compressed Sparse Row format>
I then decided to try creating a sub pipeline with just KMeans, since my understanding was that pipelines can't have a fit_predict()
method in the middle. This also did not work:
pipeline = Pipeline([
('cluster', Pipeline([
('kmeans', KMeans(5))
])),
('one_hot', OneHotEncoder())
])
pipeline.fit_transform(foo)
Returns the same thing:
<100x25 sparse matrix of type '<class 'numpy.float64'>'
with 500 stored elements in Compressed Sparse Row format>
So now I am out of ideas of how to get this kind of program flow to work. any suggestions?
Edit:
So I found a work around by creating a new class from KMeans and redefining fit_transform()
. Also figured out that I should be using LabelBinarizer()
instead of OneHotEncoder()
.
class KMeans_foo(KMeans):
def fit_transform(self, X, y=None):
return self.fit_predict(X)
pipeline = Pipeline([
('cluster', KMeans_foo(3)),
('binarize', LabelBinarizer())
])
pipeline.fit_transform(foo)
Returns:
array([[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
...,
[0, 1, 0]])
Edit2:
Found a "cleaner" method of creating a wrapper for any sklearn model where you want to use the output of a predict method as a intermediate step:
class ModelTransformer(TransformerMixin):
def __init__(self, model):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
return pd.DataFrame(self.model.predict(X))
pipeline = Pipeline([
('cluster', ModelTransformer(KMeans_foo(3))),
('binarize', LabelBinarizer())
])