2

I have a dataFrame of 1000 columns, and I am trying to get some statistics by doing some operations on each column. I need to sort each column so, I can't basically do multi column operations on it. I am doing all these column operations in a function called processColumn

def processColumn(df: DataFrame): Double = {

  // sort the column
  // get some statistics
}

To get this done, I am persisting the dataframe in memory, and doing a scala multi thread processing on it. So, the code is something like this

Let say the initial dataframe is df

df.columns.grouped(100).foreach { columnGroups =>

  val newDf = df.select(columnGroups.head, columnGroups.tail:_*)
  newDf.persist()

  val parallelCol = columnGroups.par 
  parallelCol.tasksupport = new ForkJoinTaskSupport(
    new scala.concurrent.forkjoin.ForkJoinPool(4)
  )

  parallelCol.foreach { columnName =>

     val result = processColumn(df.select(columnName))
     // I am storing result here to a synchronized list
  }
  newDf.unpersist()
}

So, if you see, I am specifying 4 threads to run at a time. But what happens sometimes is that one of the threads gets stuck, and I have more than 4 active jobs running. And the ones that gets stuck never finishes.

I feel the threads that starts from scala parallel collections have a time out, where sometimes it don't wait for all jobs to finish. And then the unpersist gets called. So, the active job is now stuck forever. I am trying to figure it out by going to source code to see if scala collections operations have a timeout, but haven't been able to figure it out for sure.

Any help will be highly appreciated. Also, please let me know if you have any questions. Thank you.

Debasish
  • 113
  • 1
  • 9

0 Answers0