In an attempt to make my pandas code faster I installed modin and tried to use it. A merge of two data frames that had previously worked gave me the following error:
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.frame.DataFrame'>
Here is the info of both data frames:
printing event_df.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980101 entries, 0 to 1980100
Data columns (total 5 columns):
other_id object
id object
category object
description object
date datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 75.5+ MB
printing other_df info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752438 entries, 0 to 752437
Data columns (total 4 columns):
id 752438 non-null object
other_id 752438 non-null object
Value 752438 non-null object
Unit 752438 non-null object
dtypes: object(4)
memory usage: 23.0+ MB
Here are some rows from event_df:
other_id id category description date
08E5A97350FC8B00092F 1 some_string some_string 2019-04-09
17B71019E148415D 4 some_string some_string 2019-11-08
17B71019E148415D360 7 some_string some_string 2019-11-08
and here are 3 rows from other_df:
id other_id Value Unit
a01 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a02 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a03 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
I tried installing the version cited in this question Join two modin.pandas.DataFrame(s), but it didn't help.
Here's the line of code throwing the error:
joint_dataframe2 = pd.merge(event_df,other_df, on = ["id","other_id"])
It seems there is some problem with modin's merge functionality. Is there any workaround such as using pandas for the merge and using modin for a groupby.transform()? I tried overwriting the pandas import after the merge with import modin.pandas, but got an error saying pandas was referenced before assignment. Has anyone come across this problem and if so, is there a solution?