I am trying to create a recommender based on various feature of an object(eg: categories,tags,author,title,views,shares,etc). As you can see these features are of mixed type and also I do not have any user-specific data. After displaying details of one of the object, I want to display 3 more similar objects. I am trying to use kNN with sklearn and found out one-hot encoding is useful in such cases. But I don't know how to apply them together with KNN. Any help is welcome, even with a totally different library or approach. I'm new to ML.
2 Answers
Check out the Pipeline interface and this good introduction. Pipelines are a clean way of organizing preprocessing with model- and hyper-parameter selection.
My basic setup looks like this:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
numeric = [list of numeric column names]
categorical = [list of categorical column names]
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
])),
('model', KNeighborsClassifier())
])
This allows you to simply try out different classifiers, feature transformers (e.g. MinMaxScaler() instead of StandardScaler()), even in a big grid search together with classifier hyper-parameters.

- 6,323
- 1
- 18
- 33
I assume you already have your data cleaned and stored in a pandas.DataFrame
or another array-like structure. At this step you would do
import pandas as pd
# Retrieve and clean your data.
# Store it in an object df
df_OHE = pd.get_dummies(df)
# At this stage you will want to rescale your variable to bring them to a similar numeric range
# This is particularly important for KNN, as it uses a distance metric
from sklearn.preprocessing import StandardScaler
df_OHE_scaled = StandardScaler().fit_transform(df_OHE)
# Now you are all set to use these data to fit a KNN classifier.
See pd.get_dummies() doc. And this discussion for the explanation of the need of scaling for KNN. Note, that you can experiment with other types of scalers in sklearn.
P.S. I assume you are interested in a solution in python, as you mention those particular packages.

- 3,207
- 18
- 29
-
Thanks a lot for the help. I am having trouble using KNN after data transformation on categorical data as I know only to use kNN with numerical data as it uses euclidean distance. Any video/demo code I can look for? By that time I have used jaccard similarity for each categorical data with specific weightage and planning to combine this with other numerical data like view_count, etc.. for further use with KNN. Although features like title,text body will still be left. Any suggestion for a complete solution reference in my case? – sns May 14 '18 at 22:23
-
1"Trouble" as in *The KNN classifier fitting crashes* or *The output KNN classifier is garbage and does not show any similarity*? I'm not aware of a complete tutorial that would show OHE usage in KNN. But I did it in the past for the Titanic competition on kaggle. It is available here: https://github.com/mlisovyi/TitanicSurvivalGuide. I do not claim it to be a perfect tutorial, but it is the only example thatr I know of :) – Mischa Lisovyi May 15 '18 at 08:03
-
1Why wouldn't you use a `sklearn.preprocessing.MultiLabelBinarizer` instead of `get_dummies(df)`? Or would both methods work? Just curious. – sawyermclane Oct 21 '18 at 04:21