I have a dataFrame of 1000 columns, and I am trying to get some statistics by doing some operations on each column. I need to sort each column so, I can't basically do multi column operations on it. I am doing all these column operations in a function called processColumn
def processColumn(df: DataFrame): Double = {
// sort the column
// get some statistics
}
To get this done, I am persisting the dataframe in memory, and doing a scala multi thread processing on it. So, the code is something like this
Let say the initial dataframe is df
df.columns.grouped(100).foreach { columnGroups =>
val newDf = df.select(columnGroups.head, columnGroups.tail:_*)
newDf.persist()
val parallelCol = columnGroups.par
parallelCol.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(4)
)
parallelCol.foreach { columnName =>
val result = processColumn(df.select(columnName))
// I am storing result here to a synchronized list
}
newDf.unpersist()
}
So, if you see, I am specifying 4 threads to run at a time. But what happens sometimes is that one of the threads gets stuck, and I have more than 4 active jobs running. And the ones that gets stuck never finishes.
I feel the threads that starts from scala parallel collections have a time out, where sometimes it don't wait for all jobs to finish. And then the unpersist gets called. So, the active job is now stuck forever. I am trying to figure it out by going to source code to see if scala collections operations have a timeout, but haven't been able to figure it out for sure.
Any help will be highly appreciated. Also, please let me know if you have any questions. Thank you.