I have three dataframes df1 with 1 160 164 rows and 4 variables,df2 with 11241 rows and 4 variables, and df3 with 1 630 644 rows and 6 variables
df1 looks like :
df2 looks like :
The observations in df1 are those in df3 with energy_kcal_100g_nettoye full.
The observations in df2 are those in df3 with energy_kcal_100g_nettoye no available.
df3 looks like :
I need to find euclidean distance between each rows of df1 and df2 (not within df1 or df2). Then i need to keep the 5 closest index to compute the mean of energy_kcal_100g_nettoye on the 5 index in df3.
I try using this code but it never ends :
for index, row in df2.iterrows():
dist_matrix = df1.apply(lambda row2: [np.linalg.norm(row2.values - row.values)], axis=1)
dist_matrix=dist_matrix.sort_values()
observation=dist_matrix[0:5].index
echantillon=df3.loc[observation]
df3.loc[index,'energy_kcal_100g_mean_distance_5']=echantillon['energy_kcal_100g_nettoye'].mean()
I want to use vectorization to do it faster but i don't succeed. My data is too big.
Can you help me pleased
Thanks
Ps: sorry for my english