DataFrame remove rows existing in another DataFrame

Question

I have two data frames:

df1:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+

df2:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
|    201536|  ROBERT MITZ|01722-3500377|+5215553623333| asceloe@my.com|
|    201537|     MARY ENG|01722-3500127|+5215553623111|generic1@my.com|
|    201538|    RICK BURT|01722-3500983|+5215553623324|generic2@my.com|
|    201539|     JHON DOE|01722-3502547|+5215553621476|generic3@my.com|
+----------+-------------+-------------+--------------+---------------+

And I need to get a third DataFrame with the ones from df1 that does not exist in df2.

like this:

+----------+-------------+-------------+--------------+---------------+
|customerId|     fullName|   telephone1|    telephone2|          email|
+----------+-------------+-------------+--------------+---------------+
|    879535|  MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+

Whats is the correct way of doing this?

I've already tried the following:

diff = df2.join(df1, df2['customerId'] != df1['customerId'],"left")

diff = df1.subtract(df2)

diff = df1[~ df1['customerId'].isin(df2['customerId'])]

But they do not work, any suggestions?

In general, it will be easier for people to help if you can provide code to generate your dataframes. — ASGM, Sep 17 '21 at 21:58
your "like this" example is of the ones that do exist in df2 however you say your "need" is "that does not exist in df2" Please resolve the contradiction or we cannot like this. — Abel, Sep 17 '21 at 22:00

score 3 · Answer 1 · answered Sep 17 '21 at 22:08

You can use merge with indicator=True:

df3 = df1.merge(df2, on=df1.columns.tolist(), how='left', indicator=True)
df3 = df3[df3['_merge'] == 'left_only'].drop(columns='_merge')

Output:

>>> df3
   customerId     fullName     telephone1     telephone2           email
1      879535  MARIO LOPEZ  01722-3500377  5215553623333  asceloe@my.com

sophocles · Accepted Answer · 2021-09-17T22:18:47.980

2

Using pyspark:

You can create a list containing the customerId from DF2 with collect():

from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]

And then filter your DF1 customerId using isin with negation ~:

diff = df1.where(~col('customerId').isin(id_df2))

edited Sep 17 '21 at 22:18

answered Sep 17 '21 at 22:05

sophocles

13,593
3
14
33

1

I only had to import `from pyspark.sql import functions as F` and use it as `F.col()` but this worked thanks – TurboAza Sep 17 '21 at 22:56
Your solution is a recipe to kill Spark's driver node. There's a straightforward solution: `df1.join(df2, on='customerId', how='left_anti')` – Kafels Sep 17 '21 at 23:58
Thanks for your feedback @Kafels. I tend to use the above command a lot, so could you elaborate on why it's not a good practise? – sophocles Sep 20 '21 at 08:42
1

@sophocles `collect()` will move all data from Workers to Driver node, in an example where you have a lot of GB of data, this could cause an OOM exception, which means your cluster entire will go down. Usually, should be rare the use cases where you really need to collect or move some data from spark to python context to apply a specific logic. Otherwise, if you are working with plotting data, there's no way to avoid it – Kafels Sep 20 '21 at 11:23

DataFrame remove rows existing in another DataFrame

2 Answers2