I have a dataset of growth rates by time and individual. I'm trying to use KNN to predict growth rates based on historical growth for other individuals.
To start, I transformed my transaction-level dataset so that each row represents an individual, and the columns are the time (in days). I want to find the individuals with the closest values,
Here's my sample code:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
neigh = NearestNeighbors(n_neighbors=5, metric = 'euclidean')
df = pd.DataFrame([['A',1,1,.2],['A',1,2,.25],['A',1,4,.3],['B',0,1,.5],['B',0,3,.52],['B',0,2,.51]
,['C',1,1,1.1],['C',1,2,1.3],['C',1,4,1.5]],columns = ['Cust_ID','Gender_Male','Day_No','Value'])
df_unstacked = df.set_index(['Cust_ID','Gender_Male','Day_No']).unstack()
print df_unstacked
Day_No 1 2 3 4
Cust_ID Gender_Male
A 1 0.2 0.25 NaN 0.3
B 0 0.5 0.51 0.52 NaN
C 1 1.1 1.30 NaN 1.5
neigh.fit(df_unstacked) #Throws error:
ValueError: Input contains NaN, infinity or a value too large for
dtype('float64').
How should I structure this data to not throw the error for missing values? I don't want to impute values - I want it to only calculate distances for values that exist. I would like to be able to find the mean value for nearby neighbors each day if I enter a sample row
I know this is possible, because I've done it before with recommender systems and sparse data, but I'm not familiar with the sklearn KNN syntax and how to get it to skip NaN values when calculating the distance/similarity.