pyspark Anti-join 2 dataframes

Question

I have 2 data frames df and df1. I want to filter out the records that are in df from df1 and I was thinking an anti-join can achieve this. But the id variable is different in 2 tables and I want to join the tables on multiple columns. Is there an neat way to do this ?

df1

user_id	name
1	AAA
2	ABC
3	XYZ

df

id	name
1	AAA
2	ABC

Result_dataframe

id	name
3	XYZ

score 0 · Answer 1 · answered Mar 01 '23 at 02:29

Try to standardize the column names in two dataframes and then use the column names in your join condition.

Example:

df=spark.createDataFrame([(1,'AAA'),(2,'ABC')],['id','name'])
df1=spark.createDataFrame([(1,'AAA'),(2,'ABC'),(3,'XYZ')],['user_id','name'])

#standardize the df1 column names based on df columns
df2 = spark.createDataFrame(df1.rdd,df.columns)

#join the dataframe using left_Anti join and list of column names
df2.join(df,*[df.columns],'left_anti').show(100,False)
#+---+----+
#|id |name|
#+---+----+
#|3  |XYZ |
#+---+----+

pyspark Anti-join 2 dataframes

1 Answers1