I am Trying to get the common data from two csv having different number of rows using vaex .When doing inner join I am getting below error .Ideally inner join wouldn't require to check for same number of rows count of dataframes
PS C:\Users\ncdex1124> & C:/Users/ncdex1124/AppData/Local/Programs/Python/Python39/python.exe C:\Users\ncdex1124\testwithvaex.py
# CL_SEQ CL_TMID CL_CLIENT_ID CL_ABC CL_CLI_STATUS CL_ABC_STATUS CL_ACTION_TYPE CL_CREATED_DATE CL_CREATED_BY
CL_MODIFIED_DATE CL_MODIFIED_BY
0 3793375 21 MEN0008 ARJPP6330Q A V 0 2019-09-06 14:23:40 SYSTEM
2019-09-09 11:28:03 SYSTEM
1 3793378 8 AACCB1987D AACCB1987D A V 0 2019-09-06 14:23:40 SYSTEM
2019-09-06 14:24:06 SYSTEM
2 3793381 10 AABCI2081G AABCI2081G A V 0 2019-09-06 14:23:40 SYSTEM
2019-09-06 14:24:06 SYSTEM
3 3793383 11 AABCN9894G AABCN9894G A V 0 2019-09-06 14:23:40 SYSTEM
2019-09-06 14:24:06 SYSTEM
4 3793387 12 AAACM0267F AAACM0267F A V 0 2019-09-06 14:23:40 SYSTEM
2019-09-06 14:24:06 SYSTEM
... ... ... ... ... ... ... ... ... ...
... ...
180,011 3793368 185 AACCC0537F AACCC0537F A V 0 2017-06-24 22:21:45 SYSTEM
-- --
180,012 3793369 161 AAACP7015P AAACP7015P A V 0 2017-06-24 22:21:45 SYSTEM
-- --
180,013 3793370 159 AAACA8392E AAACA8392E A V 0 2017-06-24 22:21:45 SYSTEM
-- --
180,014 3793371 167 AACCC5501F AACCC5501F A V 0 2017-06-24 22:21:45 SYSTEM
-- --
180,015 3793372 168 AAHCS7515E AAHCS7515E A V 0 2017-06-24 22:21:45 SYSTEM
-- --
# CL_CLIENT_ID
0 5AFV4
1 5AFV6
2 5AFV8
3 5AFZ1
4 5AGB6
... ...
178,093 AACCC0537F
178,094 AAACP7015P
178,095 AAACA8392E
178,096 AACCC5501F
178,097 AAHCS7515E
3
Traceback (most recent call last):
File "C:\Users\ncdex1124\testwithvaex.py", line 43, in <module>
a=CompareCSV(fileName,file5)
File "C:\Users\ncdex1124\testwithvaex.py", line 32, in CompareCSV
df_join = vaex_df1.join(vaex_df2,
File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\dataframe.py", line 6266, in join
return vaex.join.join(**kwargs)
File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\join.py", line 284, in join
dataset = left.dataset.merged(right_dataset)
File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\dataset.py", line 434, in merged
return DatasetMerged(self, rhs)
File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\dataset.py", line 1220, in __init__
raise ValueError(f'Merging datasets with unequal row counts ({self.left.row_count} != {self.right.row_count})')
ValueError: Merging datasets with unequal row counts (178537 != None)
My code
vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
vaex_df1 = vaex.open(file1+'.hdf5')
vaex_df2 = vaex.open(file2+'.hdf5')
print(type(vaex_df1),vaex_df1)
print(type(vaex_df2),vaex_df2)
df_join = vaex_df1.join(vaex_df2,how='inner',left_on ='CL_CLIENT_ID',right_on='CL_CLIENT_ID',allow_duplication=True)
df_join.export_csv('C:\\Users\\abc\Desktop\\New folder\\file3.csv',chunk_size=10000)
print("succes in compare")
How can I avoid this scenario in code .I am using vaex instaed of pandas as it is faster.