0

I have two variables. One is a Dataframe and other is a List[Dataframe]. I wish to perform a join on these. At the moment I am using the following appoach:

def joinDfList(SingleDataFrame: DataFrame, DataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = {

    var joinedDf = SingleDataFrame
    DataFrameList.foreach(
      Df => {
        joinedDf = joinedDf.join(Df, groupByCols, "left_outer")
      }
    )
    joinedDf.na.fill(0.0)
}

Is there an approach where we can skip usage of "var" and instead of "foreach" use "foldleft"?

  • There is, but why don't you try it yourself first? `foldLeft` operates on a list and takes one dataframe to get the ball rolling and a function that produces another dataframe from 2 dataframes. You have all of those things right there. – user Aug 05 '20 at 21:07
  • Also, variable names should start with lowercase characters, since it makes it easier to distinguish them from types (and singleton `object`s) – user Aug 05 '20 at 21:08

1 Answers1

1

You can simple write it without vars using foldLeft:

def joinDfList(singleDataFrame: DataFrame, dataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = 
  dataFrameList.foldLeft(singleDataFrame)(
    (dfAcc, nextDF) => dfAcc.join(nextDF, groupByCols, "left_outer")
  ).na.fill(0.0)

in this code dfAcc will be always join with new DataFrame from dataFrameList and in the end you will get one DataFrame

Important! be careful, using too many joins in one job might be a reason of performance degradation.

Boris Azanov
  • 4,408
  • 1
  • 15
  • 28