0

I have 2 data frames df and df1. I want to filter out the records that are in df from df1 and I was thinking an anti-join can achieve this. But the id variable is different in 2 tables and I want to join the tables on multiple columns. Is there an neat way to do this ?

df1

user_id name
1 AAA
2 ABC
3 XYZ

df

id name
1 AAA
2 ABC

Result_dataframe

id name
3 XYZ
Datamaniac
  • 171
  • 2
  • 9

1 Answers1

0

Try to standardize the column names in two dataframes and then use the column names in your join condition.

Example:

df=spark.createDataFrame([(1,'AAA'),(2,'ABC')],['id','name'])
df1=spark.createDataFrame([(1,'AAA'),(2,'ABC'),(3,'XYZ')],['user_id','name'])

#standardize the df1 column names based on df columns
df2 = spark.createDataFrame(df1.rdd,df.columns)

#join the dataframe using left_Anti join and list of column names
df2.join(df,*[df.columns],'left_anti').show(100,False)
#+---+----+
#|id |name|
#+---+----+
#|3  |XYZ |
#+---+----+
notNull
  • 30,258
  • 4
  • 35
  • 50