0

I am working on project where I have around 500 column names, but I need to apply coalesce function on every table name .

df1 schema

-id
-col1
...
-col500

df2 schema

-id
-col1
...
-col500
Dataset<Row> newDS=  df1.join(df2, "id")
.select(
                df1.col("id"),
                functions.coalesce(df1.col("col1"),df2.col("col1")).as("col1"), 
                functions.coalesce(df1.col("col2"),df2.col("col2")).as("col2"),
...
functions.coalesce(df1.col("col500"),df2.col("col500")).as("col500"),
                )

        .show();

What I have tried

 Dataset<Row> j1 =  df1.join(df2, "id");
Dataset<Row> gh1 = spark.emptyDataFrame();


    String[] f =  df1.columns();
     for(String h : f)
     {
         if(h == "id")
             gh1 = j1.select(df1.col("id"));
        else{
            gh1 = j1.select(functions.coalesce(df1.col(h),df2.col(h)).as(h));

        }


     }

     gh1.show();
Ratheri2
  • 79
  • 6

3 Answers3

0

If I understand correctly, you have two dataframes with the same schema and you want to coalesce their 500 columns 2 by 2 without having to write everything.

This can be achieved easily by providing a sequence of columns to select. Also since select does not accept sequences of columns but rather a variable number of column arguments, you need to add : _* to let scala know that it needs to treat all the elements of the sequence as separate arguments.

val cols = df1.columns.filter(_ != "id")
df1
    .join(df2, "id")
    .select(col("id") +: cols.map(n => coalesce(df1.col(n), df2.col(n)) as n) : _* )
Oli
  • 9,766
  • 5
  • 25
  • 46
0

In Java, you can pass an array of values to methods expecting variable number of arguments, so you can rewrite your code like this :

Column[] coalescedColumns = Stream.of(df1.columns())
             .map(name -> functions.coalesce(df1.col(name),df2.col(name)).as(name))
             .toArray(Column[]::new);

Dataset<Row> newDS = df1.join(df2, "id").select(coalescedColumns)

I didn't exclude the id column since coalesce will work as expected on this column as well

rluta
  • 6,717
  • 1
  • 19
  • 21
0

df1.columns will returns the String Array, so cannot invoke streams on it, refer.

Column[] coalescedColumns = 
                Stream.of(df1.columns())
               .map(name -> functions.coalesce(df1.col(name),df2.col(name)).as(name))
                 .toArray(Column[]::new);

        Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id == b.id").select(coalescedColumns);

Yashwanth Kambala
  • 412
  • 1
  • 5
  • 14