perform join on multiple DataFrame in spark

Question

I have 3dataframes generated from 3 different processes. Every dataframe is having columns of same name. My dataframe looks like this

id   val1    val2       val3    val4
 1    null   null       null    null
 2    A2      A21       A31      A41

id   val1      val2       val3      val4
 1    B1        B21        B31       B41
 2    null      null       null      null

id   val1     val2       val3    val4
 1    C1       C2        C3       C4
 2    C11      C12       C13      C14

Out of these 3 dataframes, i want to create two dataframes, (final and consolidated). For final, order of preferences - dataFrame 1 > Dataframe 2 > Dataframe 3

If a result is there in dataframe 1(val1 != null), i will store that row in final dataframe.

My final result should be :

id  finalVal1    finalVal2   finalVal3   finalVal4 
1     B1           B21         B31         B41
2     A2           A21         A31         A41

Consolidated Dataframe will store results from all 3.

How can i do that efficiently?

Could you please provide the expected output using the dataframes described above ? — cheseaux, Oct 11 '16 at 11:52
Is it correct that you either have no `null` or a full line? Is the id in the dataset when the row is null? — Wilmerton, Oct 11 '16 at 12:08
Yeah..I can either have null or full line.. Id would be in dataset when whole row is null — Neha, Oct 11 '16 at 12:25
Sorry, it's not clear for me (you say yes, but your statement contradict mine). Do lines like `1 B1 null B31 B41` exist? And in this case, do you reject the whole line or do you accept non-null values? — Wilmerton, Oct 11 '16 at 13:04
https://stackoverflow.com/questions/31240148/spark-specify-multiple-column-conditions-for-dataframe-join this question has a generalised answer here. — Nitya Yekkirala, Sep 12 '18 at 06:34

cheseaux · Accepted Answer · 2016-10-11T13:36:33.917

If I understood you correctly, for each row you want to find out the first non-null values, first by looking into the first table, then the second table, then the third table.

You simply need to join these three tables based on the id and then use the coalesce function to get the first non-null element

import org.apache.spark.sql.functions._

val df1 = sc.parallelize(Seq(
    (1,null,null,null,null),
    (2,"A2","A21","A31", "A41"))
  ).toDF("id", "val1", "val2", "val3", "val4")

val df2 = sc.parallelize(Seq(
    (1,"B1","B21","B31", "B41"),
    (2,null,null,null,null))
  ).toDF("id", "val1", "val2", "val3", "val4")

val df3 = sc.parallelize(Seq(
    (1,"C1","C2","C3","C4"),
    (2,"C11","C12","C13", "C14"))
  ).toDF("id", "val1", "val2", "val3", "val4")

val consolidated = df1.join(df2, "id").join(df3, "id").select(
  df1("id"),
  coalesce(df1("val1"), df2("val1"), df3("val1")).as("finalVal1"),
  coalesce(df1("val2"), df2("val2"), df3("val2")).as("finalVal2"),
  coalesce(df1("val3"), df2("val3"), df3("val3")).as("finalVal3"),
  coalesce(df1("val4"), df2("val4"), df3("val4")).as("finalVal4")
)

Which gives you the expected output

+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
|  1|  B1| B21| B31| B41|
|  2|  A2| A21| A31| A41|
+---+----+----+----+----+

will it work if the value in all the tables are null. I want to get null if value in all 3 is null? — Neha, Oct 11 '16 at 14:21
Yes, `coalesce` return the first non-null values, or null if everything is null — cheseaux, Oct 11 '16 at 14:22
```consolidated = df1.join(df2, "id").join(df3, "id")``` I have seen this solution and it seems very good for multplie joins. Is it possible to make it for N-amount of tables. Because in my case i have a list of data sets which I need to join, but I do not know how many data sets there will be in the list (comes from user). Is it possible to make it more generic ? — Toby Djelyinski, Nov 25 '19 at 10:55

Wilmerton · Answer 2 · 2016-10-11T13:39:17.230

Edit: New solution with partially null lines. It avoids joins, but uses a window function and a distinct...

case class a(id:Int,val1:String,val2:String,val3:String,val4:String)

val df1 = sc.parallelize(List(
a(1,null,null,null,null),
a(2,"A2","A21","A31","A41"),
a(3,null,null,null,null))).toDF()

val df2 = sc.parallelize(List(
a(1,"B1",null,"B31","B41"),
a(2,null,null,null,null),
a(3,null,null,null,null))).toDF()

val df3 = sc.parallelize(List(
a(1,"C1","C2","C3","C4"),
a(2,"C11","C12","C13","C14"),
a(3,"C11","C12","C13","C14"))).toDF()

val anyNotNull = df1.columns.tail.map(c => col(c).isNotNull).reduce(_ || _)

val consolidated = {
  df1
    .filter(anyNotNull)
    .withColumn("foo",lit(1))
    .unionAll(df2.filter(anyNotNull).withColumn("foo",lit(2)))
    .unionAll(df3.filter(anyNotNull).withColumn("foo",lit(3)))
}

scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|                                                                                                                                                                                                                                                    
+---+----+----+----+----+
|  1|  B1|null| B31| B41|
|  1|  B1|  C2| B31| B41|
|  3| C11| C12| C13| C14|
|  2|  A2| A21| A31| A41|
|  2|  A2| A21| A31| A41|
+---+----+----+----+----+

val w = Window.partitionBy('id).orderBy('foo)

val coalesced = col("id") +: df1.columns.tail.map(c => first(col(c),true).over(w).as(c))
val finalDF = consolidated.select(coalesced:_*).na.drop.distinct

scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
|  1|  B1|  C2| B31| B41|
|  3| C11| C12| C13| C14|
|  2|  A2| A21| A31| A41|
+---+----+----+----+----+

Old solution:

If you have only full lines of null or no null at all, you can do this (edit: the advantage over the other solution is that you avoid the distinct)

data:

case class a(id:Int,val1:String,val2:String,val3:String,val4:String)

val df1 = sc.parallelize(List(
a(1,null,null,null,null),
a(2,"A2","A21","A31","A41"),
a(3,null,null,null,null))).toDF()
val df2 = sc.parallelize(List(
a(1,"B1","B21","B31","B41"),
a(2,null,null,null,null),
a(3,null,null,null,null))).toDF()
val df3 = sc.parallelize(List(
a(1,"C1","C2","C3","C4"),
a(2,"C11","C12","C13","C14"),
a(3,"C11","C12","C13","C14"))).toDF()

consolidated:

val consolidated = {
  df1.na.drop.withColumn("foo",lit(1))
  .unionAll(df2.na.drop.withColumn("foo",lit(2)))
  .unionAll(df3.na.drop.withColumn("foo",lit(3)))
}

scala> consolidated.show()
+---+----+----+----+----+---+
| id|val1|val2|val3|val4|foo|
+---+----+----+----+----+---+
|  2|  A2| A21| A31| A41|  1|
|  1|  B1| B21| B31| B41|  2|
|  1|  C1|  C2|  C3|  C4|  3|
|  2| C11| C12| C13| C14|  3|
|  3| C11| C12| C13| C14|  3|
+---+----+----+----+----+---+

Final

val w = Window.partitionBy('id).orderBy('foo)
val finalDF = consolidated
  .withColumn("foo2",rank().over(w))
  .filter('foo2===1)
  .drop("foo").drop("foo2")

scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
|  1|  B1| B21| B31| B41|
|  3| C11| C12| C13| C14|
|  2|  A2| A21| A31| A41|
+---+----+----+----+----+

Your solution is way too complicated and OP doesn't have a third row in his example, it is confusing — cheseaux, Oct 11 '16 at 13:27
I don't see the problem of improving the coverage of the test example. OP asks for a performant solution, not a simple one. — Wilmerton, Oct 11 '16 at 13:32
It is confusing. You think that using a window function, distinct, filter, drop, map/reduce, a case class, etc. is more efficient than a join with coalesce ? — cheseaux, Oct 11 '16 at 13:35
filter and na.drop don't require shuffle, and so are much faster than joins. The map-reduce is executed on a very small collection (the columns) and is also very fast. Piping joins creates frightening execution plans. I don't understand why you argue subjectively on stackoverflow ("it is confusing"). — Wilmerton, Oct 11 '16 at 13:38

score 0 · Answer 3 · answered Apr 20 '19 at 21:39

Below is an example of joining six tables/dataframes (not using SQL)

retail_db is a well known sample DB, anyone can get it from Google

Problem: //Get all customers from TX who bought fitness items

 val df_customers = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "customers").option("user", "root").option("password", "root").load()
  val df_products = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "products").option("user", "root").option("password", "root").load() 
  val df_orders = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "orders"). option("user", "root").option("password", "root").load()
  val df_order_items = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_items").option("user", "root").option("password", "root").load()
  val df_categories = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "categories").option("user", "root").option("password", "root").load()
  val df_departments = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "departments").option("user", "root").option("password", "root").load()
  val df_order_items_all = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_all").option("user", "root").option("password", "root").load()



  val jeCustOrd=df_customers.col("customer_id")===df_orders.col("order_customer_id")
  val jeOrdItem=df_orders.col("order_id")===df_order_items.col("order_item_order_id")
  val jeProdOrdItem=df_products.col("product_id")===df_order_items.col("order_item_product_id")
  val jeProdCat=df_products.col("product_category_id")===df_categories.col("category_id")
  val jeCatDept=df_categories.col("category_department_id")===df_departments.col("department_id")





  df_customers.where("customer_state = 'TX'").join(df_orders,jeCustOrd).join(df_order_items,jeOrdItem).join(df_products,jeProdOrdItem).join(df_categories,jeProdCat).join(df_departments,jeCatDept).filter("department_name='Fitness'")
  .select("customer_id","customer_fname","customer_lname", "customer_street","customer_city","customer_state","customer_zipcode","order_id","category_name","department_name").show(5)

score -3 · Answer 4 · answered Oct 11 '16 at 12:16

If they are from three different tabels, I would use push down filters to filter them on server and use join between data frame join function to join them together.

If they are not from database tables; you can use filter and map high order function to the same parallel.

perform join on multiple DataFrame in spark

4 Answers4

Linked