0

I have two large datasets Df1 & Df2 with numeric variables. I have to match each numeric variable in Df2 with its corresponding one in Df1. The column names are completly different in both datasets.

I have already tried those two methods, with cn a column of Df_2:

def float_compare_desc(Df1, Df2, cn, match_f):

for cr in Df1.columns:
    if str(Df1[cr].dtype) == 'float64':
        desc_n = Df2[cn].describe(percentiles=list(np.linspace(0.1,0.9,9)))[1:]
        desc_r = Df1[cr].describe(percentiles=list(np.linspace(0.1,0.9,9)))[1:]                  
        match_f[cn,cr] = mean_squared_error(desc_n , desc_r)
    else:
        continue
return match_f

and this one with kolmogorov smirnov:

def float_compare_ks(Df1, Df2, cn, match_f):

for cr in Df1.columns:
    if str(Df1[cr].dtype) == 'float64':
        match_f[cn,cr] = stats.ks_2samp(cn,cr)
    else:
        continue
return match_f

Unfortunately this two methods are not giving decent results. Does someone know other possible methods?

Thanks in advance

yassine
  • 55
  • 8

1 Answers1

0

if you've created those datasets from an excel or csv, the numeric values may not work as excepted you need to cast them into python integers/floats.

Coming to comparisions, if you only want to compare one column from dataset 1 and another from dataset2, you can do something like this:

col1 = df1["youcol"].tolist()
col2 = df2["youcol"].tolist()
#assuming length of both the lists are same, you can loop it
for i in range(len(col1)):
    if float(col1[i]) == float(col2[i]): #you can use int if you like to
        #do something
    else:
        #do something else
Venkatesh Dharavath
  • 500
  • 1
  • 5
  • 18
  • Actually, this is not the problem. I already changed all my variables to float. and the matching is going well. My question is really abouth the method to use in order to have the similarity ratio between two numeric columns. – yassine Dec 05 '20 at 12:05
  • @yassine, If you're searching for an inbuilt function I'm not sure, but if you could post what exactly you want by comparing two columns, I may write a function for you. Add the output you want. – Venkatesh Dharavath Dec 06 '20 at 10:22