1

I am trying to join two the dataframes as shown below on the code column values present in the name_data dataframe.

I have two dataframes shown below and I expect to have a resulting dataframe which would only have the rows from the `team_datadataframe where the correspondingcodevalue column is present in thename_data``` dataframe.

I am using koalas for this on databricks and I have the following code using the join operation.

import databricks.koalas as ks

name_data= ks.DataFrame({'code':['123a', '345b', '678c'],
        'id':[1, 2, 3]})

team_data = ks.DataFrame({'code':['123a', '23s', '34a'],
        'id':[1, 2, 3]})

team_data_filtered = team_data.join(name_data.set_index('code'), on='code')

display(team_data_filtered)

The expected output would be to see only the following in team_data_filtered.

Code    id
'123a'   1

But my code is throwing an error stating that columns overlap but no suffix specified: ['id'].

May someone help to resolve this issue?

Anna
  • 181
  • 1
  • 12

2 Answers2

1

Try adding suffix parameters:

team_data_filtered = team_data.join(name_data.set_index('code'), on='code', 
                                                lsuffix='_1', rsuffix='_2')
team_data_filtered = team_data_filtered.loc[team_data_filtered.id_1==team_data_filtered.id2]
display(team_data_filtered)   

An then to clean the columns, if desired:

team_data_filtered.rename({'id_1':'id'}, inplace=True, axis=1)

                                                                     
AlecZ
  • 546
  • 5
  • 9
  • 1
    what does the lsuffix and rsuffix do? and what does the _1 and _2 mean? – Anna Feb 15 '22 at 18:12
  • For any join where the non "on" columns have the same name, a suffix is required (l or r, for left or right) to append a string suffix to its name. I updated to clean up the dataframe after processing the join – AlecZ Feb 15 '22 at 18:19
  • I tried to do sth as the following: ```team_data_filtered = name_data.join(team_data.set_index('code'), on='code', lsuffix='_1', rsuffix='_2').dropna().drop(['*_1','*_2'])```. I try to drop all the columns that start with *_1 and *_2 as detailed in the suffix, I am not sure what this does not work as it says invalid syntax in drop(). May you help me resolve it? – Anna Feb 15 '22 at 18:24
  • 1
    You'd probably only want to drop one, and rename another? In any event the syntax wouldn't be '*_2' but 'id_2', for example, or ['id_1', 'id_2']. If you had more columns you could do : drop([i + '_2' for i in dataframe.columns]) – AlecZ Feb 15 '22 at 18:31
  • how do i rename all the columns with suffix _1 using the code rather than hardcoding the values as I can have many columns in the dataframe? – Anna Feb 15 '22 at 18:35
  • 1
    You can chain them, such as: team_data_filtered.drop([i for i in team_data_filtered.columns if i.endswith('_1')]) – AlecZ Feb 15 '22 at 18:59
0
name_data= ps.DataFrame({'code':['123a', '345b', '678c'],
                         'id':[1, 2, 3]})

team_data = ps.DataFrame({'code':['123a', '23s', '34a'],
                          'id':[1, 2, 3]})

team_data_filtered = team_data.join(name_data.set_index('code'),how='inner', on='code',rsuffix="_2")

team_data_filtered.drop("id_2",axis=1).to_spark().show()

out:

+----+---+
|code| id|
+----+---+
|123a|  1|
+----+---+
G.G
  • 639
  • 1
  • 5