I have two large datasets Df1 & Df2 with numeric variables. I have to match each numeric variable in Df2 with its corresponding one in Df1. The column names are completly different in both datasets.
I have already tried those two methods, with cn a column of Df_2:
def float_compare_desc(Df1, Df2, cn, match_f):
for cr in Df1.columns:
if str(Df1[cr].dtype) == 'float64':
desc_n = Df2[cn].describe(percentiles=list(np.linspace(0.1,0.9,9)))[1:]
desc_r = Df1[cr].describe(percentiles=list(np.linspace(0.1,0.9,9)))[1:]
match_f[cn,cr] = mean_squared_error(desc_n , desc_r)
else:
continue
return match_f
and this one with kolmogorov smirnov:
def float_compare_ks(Df1, Df2, cn, match_f):
for cr in Df1.columns:
if str(Df1[cr].dtype) == 'float64':
match_f[cn,cr] = stats.ks_2samp(cn,cr)
else:
continue
return match_f
Unfortunately this two methods are not giving decent results. Does someone know other possible methods?
Thanks in advance