0

Say I have following code.

def f(x): (Array[Double], Array[Int])={
    val   data: Array[Double]   //1D array
    val   index: Array[Int]    //Data element's index

    //Read data from a file into "data"
    //Generate index (based on value "x") into "index"

    (dset_datas, index)
}

sc.range(0, 10, 1, 10).flatMap.(x => (f(x)._1 zip f(x)._2))

Questions:

1) Will the function f(x) be called twice for each x within flatmap? Since I called f(x)._1 first and then f(x)._2.

2) Will flapmap be executed (especfially the data reading part) in parallel? Say I had 3 nodes and each node has 32 cores. I set --num-executors=2 and --executor-cores=32. Another node is used as driver node.

To answer the above questions, I searched docs for Spark/Scala a lot but didn't get any answers from there. I tried to run the code on my own system. It looks like that

1) f(x) is called twice because I found the data partitions are processed twices. But, I am not sure.

2) I noticed two executor folder are created under the spark log file system also some stdout from each executor. But, I am not sure too.

Thanks !

Bin Dong
  • 21
  • 2

1 Answers1

0

1) Every worker will execute f(x) twice, since it's being invoked two times in your function literal - each time extracting different element of the resulting tuple.

2) The last parameter of your range method is 10, which means that your range RDD will have 10 partitions. This means that the upper bound of parallel executions of that flatMap is 10 (if you'd have 10 executors, flatMap could be executed parallelly on every of those executors). Since you have two executors, flatMap would still be executed in parallel, but only on those two executors.

Paweł Jurczenko
  • 4,431
  • 2
  • 20
  • 24