0

I have two dataframes as shown below.

import databricks.koalas as ks

input_data = ks.DataFrame({'code':['123a', '345b', '678c'],
        'id':[1, 2, 3]})

my_data = ks.DataFrame({'code':['123a', '12a', '678c'],
        'id':[7, 8, 9], 'stype':['A', 'E', '5']})

These two dataframes have a column called code and I want to check the values in column code that exist in my_data and also exist in input_data and store them in a resulting dataframe called output. The output dataframe will have only the code column values that are present in the input_data. The number of columns in each dataframe can differ and I have just shown a sample here

The output dataframe will have a result such as follows based on the provided sample in this question.

display(output)

# Result is below
 Code    id 
'123a'   7

I found solutions online that mostly use for loops but I was wondering if there is a more efficient way to approach this.

Thank you all!

Anna
  • 181
  • 1
  • 12
  • my mistake, it is koalas. Sorry for the error. I updated the question – Anna Feb 16 '22 at 20:55
  • I'm confused by what you want in the output. Originally you said that you want the column "code" for values that exist in both "input_data" and "my_data", but right after you say that the "output" dataframe will only have the "code" columns values that are present in the "input_data" dataframe. To clarify, do you want ONLY the rows where the "code" columns match up? – Alex McGraw Feb 16 '22 at 21:20
  • yes that is correct, I only want ONLY the rows where the "code" columns match up – Anna Feb 16 '22 at 21:30

1 Answers1

0

Can try using an inner merge on the two dataframes, and then on the new dataframe, just keeping the one column you want.

For example,

df_new = my_data.merge(input_data, on='code')
df_new = df_new[['code', 'id']]