0

I am having a small issue in comparing two dataframes and the dataframes are detailed as below. The dataframes detailed below are all in koalas.

import databricks.koalas as ks


mini_team_df_1 = ks.DataFrame(['0000340b'], columns = ['team_code'])

mini_receipt_df_2 = ks.DataFrame(['0000340b'], columns = ['team_code'])

mini_receipt_df_2['match_flag'] = mini_receipt_df_2['team_code'].isin(ks.DataFrame(mini_team_df_1))

mini_receipt_df_2

I am executing this code on databricks and I expect the mini_receipt_df_2 should have the output as follows:

    team_code   match_flag

0   0000340b     True

But in my code shown above, the output is as follows:

    team_code   match_flag
0   0000340b     False

This makes no sense to me as using the .isin function would give me the True value for team_code = 0000340b as this is the same in both dataframes.

May someone help me understand what is wrong?

Thank you

Anna
  • 181
  • 1
  • 12

2 Answers2

1

Try this:

mini_receipt_df_2['match_flag'] = np.isin(mini_team_df_1['team_code'].to_numpy(), mini_receipt_df_2['team_code'])

Output:

>>> mini_receipt_df_2
  team_code  match_flag
0  0000340b        True
  • The input dataframes are koalas dataframe, so I am not sure this will work in my case. Can you help me with a solution that works for koalas dfs? – Anna Feb 09 '22 at 16:08
  • What won't work about it? –  Feb 09 '22 at 16:10
  • 1
    I get this error message, ```PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead``` – Anna Feb 09 '22 at 16:11
  • Okay, I see. Check the answer now. I came up with a different solution. –  Feb 09 '22 at 16:12
0
mini_receipt_df_2.merge(mini_team_df_1,how='left',suffixes=[None,'_2'])\
    .assign(match_flag=True)

out:

  team_code  match_flag
0  0000340b        True
G.G
  • 639
  • 1
  • 5