scikit-learn KMeans to OneHot in pipeline?

Question

I am looking to use scikit-learn's KMeans to cluster a set of variables into K bins and then use OneHotEncoder to binarize the columns. I'd like to use this functionality in a pipeline, but I think I'm having problems since KMeans uses the fit_predict() method to return the class and not fit_transform().

Here is some sample code:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

foo = np.random.randn(100, 50)

km = KMeans(3)
ohe = OneHotEncoder()

bar = km.fit_predict(foo)
ohe.fit_transform(bar.reshape(-1, 1))

This returns the expected 100x3 matrix:

<100x3 sparse matrix of type '<class 'numpy.float64'>'
    with 100 stored elements in Compressed Sparse Row format>

If I stick KMeans in a pipeline:

pipeline = Pipeline([
    ('kmeans', KMeans(3))
])

pipeline.fit_predict(foo)

It returns the non binarized classes:

array([1, 2, 2, 0, ... , 1])

However if I use both KMeans and OneHotEncoder, KMeans feeds its fit_transform() method into OneHotEncoder which "transforms X to cluster-distance space":

pipeline = Pipeline([
    ('cluster', KMeans(5)),
    ('one_hot', OneHotEncoder())
])

pipeline.fit_transform(foo)

It returns all the linear distances one-hot encoded and a 100x25 array:

<100x25 sparse matrix of type '<class 'numpy.float64'>'
    with 500 stored elements in Compressed Sparse Row format>

I then decided to try creating a sub pipeline with just KMeans, since my understanding was that pipelines can't have a fit_predict() method in the middle. This also did not work:

pipeline = Pipeline([
    ('cluster', Pipeline([
        ('kmeans', KMeans(5))
    ])),
    ('one_hot', OneHotEncoder())
])

pipeline.fit_transform(foo)

Returns the same thing:

<100x25 sparse matrix of type '<class 'numpy.float64'>'
    with 500 stored elements in Compressed Sparse Row format>

So now I am out of ideas of how to get this kind of program flow to work. any suggestions?

Edit:

So I found a work around by creating a new class from KMeans and redefining fit_transform(). Also figured out that I should be using LabelBinarizer() instead of OneHotEncoder().

class KMeans_foo(KMeans):
    def fit_transform(self, X, y=None):
        return self.fit_predict(X)

pipeline = Pipeline([
    ('cluster', KMeans_foo(3)),
    ('binarize', LabelBinarizer())
])

pipeline.fit_transform(foo)

Returns:

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       ..., 
       [0, 1, 0]])

Edit2:

Found a "cleaner" method of creating a wrapper for any sklearn model where you want to use the output of a predict method as a intermediate step:

class ModelTransformer(TransformerMixin):
    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return pd.DataFrame(self.model.predict(X))

pipeline = Pipeline([
    ('cluster', ModelTransformer(KMeans_foo(3))),
    ('binarize', LabelBinarizer())
])

The thing with Pipeline is that it requires all (except last one) estimators in the list to be transformers. Only the last estimator can be a predictor or a transformer. — Vivek Kumar, Mar 24 '17 at 16:45
Yeah. My original thoughts was if I made a sub-pipeline with only the KMeans, it would fit_predict on that. That didn't work and found a much cleaner version of creating a new class. — PL3, Mar 24 '17 at 18:41
Please move your solution(s) out of the question and into an answer. — Ben Reiniger, Oct 23 '22 at 17:54

scikit-learn KMeans to OneHot in pipeline?

0 Answers0

Linked