Coalesce dynamic column list from two datasets

Question

I am trying to translate a pyspark job, which is dynamically coalescing the columns from two datasets with additional filters/condition.

conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in ['firstname','middlename','lastname']]

can I do this in scala?

What I have tried so far is:

df1.join(df2, Seq("col1"), "outer").select(col("col1"), coalesce(df1.col("col2"), df2.col("col2")).as(col("col2"), coalesce(df1.col("col3")..........as(col("col30"))

is there a better way to add them with a loop instead of expanding this?

score 0 · Answer 1 · answered Oct 20 '22 at 07:49

You can try this

var columns: Seq[org.apache.spark.sql.Column] = Seq()

for( element <- df1.columns) {
  val c = coalesce(df1(element), df2(element)).alias(element)
  
  columns = columns :+ c
}

df1.join(df2, Seq("col1"), "outer").select(columns:_*).show

score 0 · Answer 2 · answered Oct 20 '22 at 07:56

The condition you have in pySpark can be translated to Scala. Check. this:

df1.columns
      .filter(name => !Array("firstname", "middlename", "lasstname").contains(name))
      .map(c => {
        when(!(df1.col(c) === df2.col(c)), lit(c)).otherwise("")
      })

Coalesce dynamic column list from two datasets

2 Answers2

Linked