PySpark Dataframes: Full Outer Join with a condition

Question

I have the following 2 dataframes-

dataframe_a
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|     wanadoo.fr|
|        samantha|     randomn.fr|
|             bob|    eidsiva.net|
|           dylan|    vodafone.it|
+----------------+---------------+

dataframe_b
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|  oldwebsite.fr|
|        samantha|     randomn.fr|
|           dylan|      oldweb.it|
|            ryan|      chicks.it|
+----------------+---------------+

I want to do a full outer join but retain the value from the domain column of dataframe_a in cases where I get 2 different domains for a single user_id. So, my desired dataframe would look like-

desired_df
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|     wanadoo.fr|
|        samantha|     randomn.fr|
|             bob|    eidsiva.net|
|           dylan|    vodafone.it|
|            ryan|      chicks.it|
+----------------+---------------+

I think I can do something like-

desired_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer").drop(dataframe_b.domain)

But I'm worried if this will give me ryan in my desired dataframe or not. Is this the right way?

score 4 · Accepted Answer · answered Nov 21 '19 at 07:01

You will want to use 'coalesce'. In your current solution, ryan will be in the resulting dataframe, but with a null value for the remaining dataframe_a.domain column.

joined_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer")

+----------------+---------------+---------------+
|         user_id|         domain|         domain|
+----------------+---------------+---------------+
|            josh|     wanadoo.fr|  oldwebsite.fr|
|        samantha|     randomn.fr|     randomn.fr|
|             bob|    eidsiva.net|               |
|           dylan|    vodafone.it|      oldweb.it|
|            ryan|               |      chicks.it|
+----------------+---------------+---------------+

'coalesce' allows you to specific a preference order, but skips null values.

import pyspark.sql.functions as F
joined_df = joined_df.withColumn(
  "preferred_domain",
  F.coalesce(dataframe_a.domain, dataframe_b.domain)
)
joined_df = joined_df.drop(dataframe_a.domain).drop(dataframe_b.domain)

Giving

+----------------+----------------+
|         user_id|preferred_domain|
+----------------+----------------+
|            josh|      wanadoo.fr|
|        samantha|      randomn.fr|
|             bob|     eidsiva.net|
|           dylan|     vodafone.it|
|            ryan|       chicks.it|
+----------------+----------------+

this is a good answer. (1) Is this the only/fastest way to do what I want? (2) Can the `coalesce` step be carried out along with the `join` statement, or will I always need to do the 3 steps you wrote? — kev, Nov 21 '19 at 07:18
It can be done in one step also but execution wise there wont be any difference. I guess.. :-) https://stackoverflow.com/questions/53587175/spark-incremental-loading-overwrite-old-record/53590644#53590644 — vikrant rana, Nov 21 '19 at 08:24
You can't do this as part of the join. You can do it as part of the same line, using chaining. However, they are semantically, exactly the same. Spark builds up a stack of operations to perform on the data set and then optimizes and executes them all at once when necessary. Putting the operations on a single, chained line versus sequentially with assignment like I did is not different. Usually that execution happens on a 'show' or 'write' operation. — Joey Lesh, Nov 21 '19 at 17:02
@JoeyLesh yup, was just asking how the chaining syntax would look like, if there's no other way to do it as part of the `join`. @vikrantrana's link shows it. — kev, Nov 21 '19 at 19:09

score 0 · Answer 2 · answered Nov 21 '19 at 06:59

0

No, doing a full_outer join will leave have the desired dataframe with the domain name corresponding to ryan as null value.No type of join operation on the above given dataframes will give you the desired output.

answered Nov 21 '19 at 06:59

Rohan Gupta

94
4

PySpark Dataframes: Full Outer Join with a condition

2 Answers2