2

I am Trying to get the common data from two csv having different number of rows using vaex .When doing inner join I am getting below error .Ideally inner join wouldn't require to check for same number of rows count of dataframes

        PS C:\Users\ncdex1124> & C:/Users/ncdex1124/AppData/Local/Programs/Python/Python39/python.exe C:\Users\ncdex1124\testwithvaex.py
    #        CL_SEQ    CL_TMID    CL_CLIENT_ID    CL_ABC      CL_CLI_STATUS    CL_ABC_STATUS    CL_ACTION_TYPE    CL_CREATED_DATE      CL_CREATED_BY  
      CL_MODIFIED_DATE     CL_MODIFIED_BY
    0        3793375   21         MEN0008         ARJPP6330Q  A                V                0                 2019-09-06 14:23:40  SYSTEM
      2019-09-09 11:28:03  SYSTEM
    1        3793378   8          AACCB1987D      AACCB1987D  A                V                0                 2019-09-06 14:23:40  SYSTEM
      2019-09-06 14:24:06  SYSTEM
    2        3793381   10         AABCI2081G      AABCI2081G  A                V                0                 2019-09-06 14:23:40  SYSTEM
      2019-09-06 14:24:06  SYSTEM
    3        3793383   11         AABCN9894G      AABCN9894G  A                V                0                 2019-09-06 14:23:40  SYSTEM
      2019-09-06 14:24:06  SYSTEM
    4        3793387   12         AAACM0267F      AAACM0267F  A                V                0                 2019-09-06 14:23:40  SYSTEM
      2019-09-06 14:24:06  SYSTEM
    ...      ...       ...        ...             ...         ...              ...              ...               ...                  ...
      ...                  ...
    180,011  3793368   185        AACCC0537F      AACCC0537F  A                V                0                 2017-06-24 22:21:45  SYSTEM
      --                   --
    180,012  3793369   161        AAACP7015P      AAACP7015P  A                V                0                 2017-06-24 22:21:45  SYSTEM
      --                   --
    180,013  3793370   159        AAACA8392E      AAACA8392E  A                V                0                 2017-06-24 22:21:45  SYSTEM
      --                   --
    180,014  3793371   167        AACCC5501F      AACCC5501F  A                V                0                 2017-06-24 22:21:45  SYSTEM
      --                   --
    180,015  3793372   168        AAHCS7515E      AAHCS7515E  A                V                0                 2017-06-24 22:21:45  SYSTEM
      --                   --
    #        CL_CLIENT_ID
    0        5AFV4
    1        5AFV6
    2        5AFV8
    3        5AFZ1
    4        5AGB6
    ...      ...
    178,093  AACCC0537F
    178,094  AAACP7015P
    178,095  AAACA8392E
    178,096  AACCC5501F
    178,097  AAHCS7515E
    3
    Traceback (most recent call last):
      File "C:\Users\ncdex1124\testwithvaex.py", line 43, in <module>
        a=CompareCSV(fileName,file5)
      File "C:\Users\ncdex1124\testwithvaex.py", line 32, in CompareCSV
        df_join = vaex_df1.join(vaex_df2,
      File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\dataframe.py", line 6266, in join
        return vaex.join.join(**kwargs)
      File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\join.py", line 284, in join
        dataset = left.dataset.merged(right_dataset)
      File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\dataset.py", line 434, in merged
        return DatasetMerged(self, rhs)
      File "C:\Users\ncdex1124\AppData\Local\Programs\Python\Python39\lib\site-packages\vaex\dataset.py", line 1220, in __init__
        raise ValueError(f'Merging datasets with unequal row counts ({self.left.row_count} != {self.right.row_count})')
    ValueError: Merging datasets with unequal row counts (178537 != None)

My code

    vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
    vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
    vaex_df1 = vaex.open(file1+'.hdf5')
    vaex_df2 = vaex.open(file2+'.hdf5')
    print(type(vaex_df1),vaex_df1)
    print(type(vaex_df2),vaex_df2)
    df_join = vaex_df1.join(vaex_df2,how='inner',left_on ='CL_CLIENT_ID',right_on='CL_CLIENT_ID',allow_duplication=True)
    df_join.export_csv('C:\\Users\\abc\Desktop\\New folder\\file3.csv',chunk_size=10000)
    print("succes in compare")

How can I avoid this scenario in code .I am using vaex instaed of pandas as it is faster.

alok sharma
  • 35
  • 1
  • 7

0 Answers0