Synchronization between Spark RDD partitions

Question

Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.

I have pondered opening a JDBC connection in each worker task node as illustrated below:

rdd.foreachPartition( partition => { 

  // 1. open jdbc connection
  // 2. poll database for the completion of dependent partition
  // 3. read dependent edge case value from computed dependent partition
  // 4. compute this partition
  // 5. write this edge case result to database
  // 6. close connection
})

I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.

Spark is for distributed computing and parallel processing. In case you need to process data in sequential manner, then you do not need Spark. You can write your jobs in java or Scala execute it from command line or may be schedule it as using any standard schedulers (cron, Quartz etc). — Sumit, Dec 04 '15 at 06:49
The reason I want to use Spark is because I need to bring large sets of data into memory for quick computation. I would also like to exploit other properties of Spark such as data and task distribution and data resiliency. I also need to achieve a degree of parallelism but I don't want to bring that into the problem space until this seemingly simple sequential step has been resolved. — kevin, Jan 05 '16 at 04:58
Think of two big array's about 20 gb in memory. I simply need to iterate through one array while comparing values from the other. I could do this on one machine given enough memory, but lets just say that I have a cluster of machines with Spark already up and running. It's simple enough to distribute the 20 gb array across the Spark cluster given the RDD abstraction. It's equally trivial to push the code to each worker process in order to iterate through each array partition. Now, just because I need sequential execution does not mean that Spark would not useful. — kevin, Jan 05 '16 at 05:04
You may not be able to sequentially process the partitions in Spark. Though there is an `fold()` function in RDD API but that too works in a distributed model for each partition. You need to follow something suggested by @zero323, though you can use some in-memory distributed caching solution like couchbase for storing intermediate results. — Sumit, Jan 05 '16 at 09:20

score 0 · Answer 1 · answered Dec 04 '15 at 07:48

Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.

Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:

Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat

What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.

Synchronization between Spark RDD partitions

1 Answers1