Spark UI active jobs getting stuck when using scala parallel collection

Question

I have a dataFrame of 1000 columns, and I am trying to get some statistics by doing some operations on each column. I need to sort each column so, I can't basically do multi column operations on it. I am doing all these column operations in a function called processColumn

def processColumn(df: DataFrame): Double = {

  // sort the column
  // get some statistics
}

To get this done, I am persisting the dataframe in memory, and doing a scala multi thread processing on it. So, the code is something like this

Let say the initial dataframe is df

df.columns.grouped(100).foreach { columnGroups =>

  val newDf = df.select(columnGroups.head, columnGroups.tail:_*)
  newDf.persist()

  val parallelCol = columnGroups.par 
  parallelCol.tasksupport = new ForkJoinTaskSupport(
    new scala.concurrent.forkjoin.ForkJoinPool(4)
  )

  parallelCol.foreach { columnName =>

     val result = processColumn(df.select(columnName))
     // I am storing result here to a synchronized list
  }
  newDf.unpersist()
}

So, if you see, I am specifying 4 threads to run at a time. But what happens sometimes is that one of the threads gets stuck, and I have more than 4 active jobs running. And the ones that gets stuck never finishes.

I feel the threads that starts from scala parallel collections have a time out, where sometimes it don't wait for all jobs to finish. And then the unpersist gets called. So, the active job is now stuck forever. I am trying to figure it out by going to source code to see if scala collections operations have a timeout, but haven't been able to figure it out for sure.

Any help will be highly appreciated. Also, please let me know if you have any questions. Thank you.

Spark UI active jobs getting stuck when using scala parallel collection

0 Answers0