Say I have following code.
def f(x): (Array[Double], Array[Int])={
val data: Array[Double] //1D array
val index: Array[Int] //Data element's index
//Read data from a file into "data"
//Generate index (based on value "x") into "index"
(dset_datas, index)
}
sc.range(0, 10, 1, 10).flatMap.(x => (f(x)._1 zip f(x)._2))
Questions:
1) Will the function f(x) be called twice for each x within flatmap? Since I called f(x)._1 first and then f(x)._2.
2) Will flapmap be executed (especfially the data reading part) in parallel? Say I had 3 nodes and each node has 32 cores. I set --num-executors=2 and --executor-cores=32. Another node is used as driver node.
To answer the above questions, I searched docs for Spark/Scala a lot but didn't get any answers from there. I tried to run the code on my own system. It looks like that
1) f(x) is called twice because I found the data partitions are processed twices. But, I am not sure.
2) I noticed two executor folder are created under the spark log file system also some stdout from each executor. But, I am not sure too.
Thanks !