1

I need to compare two csv and do inner join .I am using vaex which is faster than pandas but got stuck after a point. my code was working with pandas but it was slow .How can I inner join two hdf5 type files and get the output in csv .

My code

    vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
    vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
    vaex_df1 = vaex.open(file1+'.hdf5')
    vaex_df2 = vaex.open(file2+'.hdf5')
    print(type(vaex_df1),vaex_df1)
    print(type(vaex_df2),vaex_df2)
    df_join = pd.merge(vaex_df1,vaex_df2,how='inner',left_on ='CL_CLIENT_ID',right_on='CL_CLIENT_ID')
    df_join.to_csv('C:\\Users\\abc\Desktop\\New folder\\file3.csv')
    print("succes in compare")

As we do merge in pandas is there a way to inner join in vaex as I couldnt find much on internet. code gives error at point 'df_join=pd.merge' which is obvious .

alok sharma
  • 35
  • 1
  • 7

1 Answers1

1

The vaex tutorial has a section on joining: https://vaex.io/docs/tutorial.html#Joining. The API looks identical to that of pandas. Try:

df_join = vaex_df1.join(vaex_df2, 
                        how='inner', 
                        left_on ='CL_CLIENT_ID',
                        right_on='CL_CLIENT_ID')
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • I tried this but getting error as row count doesn't match for dataframes . and in my case I do have dataframes with different row counts . how would I tackle that ? – alok sharma Aug 24 '21 at 08:29
  • That's odd, I would expect joins to work perfectly fine between DataFrames of different length! Could you post a minimal runnable example of data from the two DataFrames, and the resulting error message? – Peter Leimbigler Aug 24 '21 at 14:19
  • please see [link]https://stackoverflow.com/questions/68897133/valueerror-merging-datasets-with-unequal-row-counts-vaex-join-python – alok sharma Aug 24 '21 at 18:00
  • Apparently it worked for vaex_df2.join , I wonder why ? – alok sharma Aug 24 '21 at 18:16
  • As @PeterLeimbigler said, can we have a reproducible / runnable example? We can't use the example you posted in the link above, as we don't have the data.. – Joco Oct 16 '21 at 01:34