Is there a shorthand for selecting only one dataframe's columns after a join?

Question

I'm working in scala with a dataframe, but the dataframe has ~60 columns.

In a Databricks pipeline, we've split a few columns out along with an identity column to validate some data, resulting in a 'reference' dataframe. I'd like to join it back to the main, large dataframe, and insert the validated data into the original column.

To keep things simple, I'd like the resultant dataframe to match the schema of the original, so none of the reference columns.

On a small scale, this isn't too hard:

 myDF = myDF
  .join(refDF,
        myDF("Identity") === refDF("RefIdentity"),
        "inner")
  .withColumn("Foo", $"refFoo")
  .select("Identity","Foo","Column2","Column3"...)

This turns into a huge pain when dealing with large numbers of columns. Is there a quicker way to select only the columns from myDF after the withColumn operation?

yes you can. First save the columns of myDf with `val cols=myDf.columns` then use `cols` in the final select expression — abiratsis, Aug 27 '21 at 18:30
Ack it was that easy - .select(cols.head, cols.tail: _*) and we're off to the races. Thanks — Blue, Aug 29 '21 at 04:31

score 1 · Accepted Answer · answered Aug 30 '21 at 15:17

1

.select takes arrays as argument, preferred method:

val cols=myDf.columns

...
    .select(cols.head, cols.tail: _*)

Thanks to abiratsis in the comments.

answered Aug 30 '21 at 15:17

Blue

163
1
12

Is there a shorthand for selecting only one dataframe's columns after a join?

1 Answers1