Dynamically loop a dataset for all column names

Question

I am working on project where I have around 500 column names, but I need to apply coalesce function on every table name .

df1 schema

-id
-col1
...
-col500

df2 schema

-id
-col1
...
-col500

Dataset<Row> newDS=  df1.join(df2, "id")
.select(
                df1.col("id"),
                functions.coalesce(df1.col("col1"),df2.col("col1")).as("col1"), 
                functions.coalesce(df1.col("col2"),df2.col("col2")).as("col2"),
...
functions.coalesce(df1.col("col500"),df2.col("col500")).as("col500"),
                )

        .show();

What I have tried

 Dataset<Row> j1 =  df1.join(df2, "id");
Dataset<Row> gh1 = spark.emptyDataFrame();


    String[] f =  df1.columns();
     for(String h : f)
     {
         if(h == "id")
             gh1 = j1.select(df1.col("id"));
        else{
            gh1 = j1.select(functions.coalesce(df1.col(h),df2.col(h)).as(h));

        }


     }

     gh1.show();

What is the issue you are having? – Ctrl S Apr 05 '19 at 13:31 — Ctrl S, Apr 05 '19 at 13:31

Oli · Answer 1 · 2019-04-05T14:08:29.843

If I understand correctly, you have two dataframes with the same schema and you want to coalesce their 500 columns 2 by 2 without having to write everything.

This can be achieved easily by providing a sequence of columns to select. Also since select does not accept sequences of columns but rather a variable number of column arguments, you need to add : _* to let scala know that it needs to treat all the elements of the sequence as separate arguments.

val cols = df1.columns.filter(_ != "id")
df1
    .join(df2, "id")
    .select(col("id") +: cols.map(n => coalesce(df1.col(n), df2.col(n)) as n) : _* )

rluta · Answer 2 · 2019-04-10T12:06:01.387

In Java, you can pass an array of values to methods expecting variable number of arguments, so you can rewrite your code like this :

Column[] coalescedColumns = Stream.of(df1.columns())
             .map(name -> functions.coalesce(df1.col(name),df2.col(name)).as(name))
             .toArray(Column[]::new);

Dataset<Row> newDS = df1.join(df2, "id").select(coalescedColumns)

I didn't exclude the id column since coalesce will work as expected on this column as well

score 0 · Accepted Answer · answered Apr 08 '19 at 09:08

df1.columns will returns the String Array, so cannot invoke streams on it, refer.

Column[] coalescedColumns = 
                Stream.of(df1.columns())
               .map(name -> functions.coalesce(df1.col(name),df2.col(name)).as(name))
                 .toArray(Column[]::new);

        Dataset<Row> newDS = df1.as("a").join(df2.as("b")).where("a.id == b.id").select(coalescedColumns);

Dynamically loop a dataset for all column names

3 Answers3