How to apply KNN on a mixed dataset(numerical + categorical) after doing one hot encoding using sklearn or pandas

Question

I am trying to create a recommender based on various feature of an object(eg: categories,tags,author,title,views,shares,etc). As you can see these features are of mixed type and also I do not have any user-specific data. After displaying details of one of the object, I want to display 3 more similar objects. I am trying to use kNN with sklearn and found out one-hot encoding is useful in such cases. But I don't know how to apply them together with KNN. Any help is welcome, even with a totally different library or approach. I'm new to ML.

score 9 · Answer 1 · answered May 15 '18 at 08:21

Check out the Pipeline interface and this good introduction. Pipelines are a clean way of organizing preprocessing with model- and hyper-parameter selection.

My basic setup looks like this:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

numeric = [list of numeric column names]
categorical = [list of categorical column names]

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
    ])),
    ('model', KNeighborsClassifier())
])

This allows you to simply try out different classifiers, feature transformers (e.g. MinMaxScaler() instead of StandardScaler()), even in a big grid search together with classifier hyper-parameters.

score 4 · Answer 2 · answered May 14 '18 at 19:12

4

I assume you already have your data cleaned and stored in a pandas.DataFrame or another array-like structure. At this step you would do

import pandas as pd

# Retrieve and clean your data.
# Store it in an object df

df_OHE = pd.get_dummies(df)

# At this stage you will want to rescale your variable to bring them to a similar numeric range
# This is particularly important for KNN, as it uses a distance metric
from sklearn.preprocessing import StandardScaler
df_OHE_scaled = StandardScaler().fit_transform(df_OHE)

# Now you are all set to use these data to fit a KNN classifier.

See pd.get_dummies() doc. And this discussion for the explanation of the need of scaling for KNN. Note, that you can experiment with other types of scalers in sklearn.

P.S. I assume you are interested in a solution in python, as you mention those particular packages.

answered May 14 '18 at 19:12

Mischa Lisovyi

3,207
18
29

Thanks a lot for the help. I am having trouble using KNN after data transformation on categorical data as I know only to use kNN with numerical data as it uses euclidean distance. Any video/demo code I can look for? By that time I have used jaccard similarity for each categorical data with specific weightage and planning to combine this with other numerical data like view_count, etc.. for further use with KNN. Although features like title,text body will still be left. Any suggestion for a complete solution reference in my case? – sns May 14 '18 at 22:23
1

"Trouble" as in *The KNN classifier fitting crashes* or *The output KNN classifier is garbage and does not show any similarity*? I'm not aware of a complete tutorial that would show OHE usage in KNN. But I did it in the past for the Titanic competition on kaggle. It is available here: https://github.com/mlisovyi/TitanicSurvivalGuide. I do not claim it to be a perfect tutorial, but it is the only example thatr I know of :) – Mischa Lisovyi May 15 '18 at 08:03
1

Why wouldn't you use a `sklearn.preprocessing.MultiLabelBinarizer` instead of `get_dummies(df)`? Or would both methods work? Just curious. – sawyermclane Oct 21 '18 at 04:21

How to apply KNN on a mixed dataset(numerical + categorical) after doing one hot encoding using sklearn or pandas

2 Answers2