0

I have 2 large spark dataframes df1 and df2. df1 has a column with a colName name that has only one distinct value. I need to add this column to the df2. I'm wondering what would be the most efficient way to do that?

My idea is to use limit() or first() on the df1 and then crossJoin with the df2 but it would trigger an action.

val newDf = df1.select(colName).limit(1)
df2.crossJoin(newDf)
  • You could collect the first row of `df1` and extract the value to then add it in `df2` without any join. – Gaël J Aug 29 '23 at 04:56

1 Answers1

1

Below are some ways to increase performance.

  1. As @Gaël J mentioned you can extract first row value and create new column with it.
  2. You can broadcast the smaller dataframe and to join, this will reduce the amount of data shuffled during the join.
  3. Reducing number of partitions of df2 dataframe according to your executor memory.

Below are the df1 and df2.

enter image description here

Getting first row value and creating newDf.

import  org.apache.spark.sql.functions._
val  value = df1.select("colName").first()(0)
val  newDf = df2.withColumn("colName",lit(value).cast("INT"))
newDf.show()

enter image description here

Using Broadcast.

import  org.apache.spark.sql.functions._
val  newDf = df1.select("colName").limit(1)
val  broadcastNewDf = broadcast(newDf)
val  resultDf = df2.crossJoin(broadcastNewDf)
resultDf.show()

enter image description here

JayashankarGS
  • 1,501
  • 2
  • 2
  • 6