4

I'm interested in finding out how Spark implements fault tolerance. In their paper they describe how they do it for "narrow dependencies" like map which is fairly straight forward. However, I they do not state what they do if a node crashes after a wide dependency like a sort operation. The only thing I could find is this:

In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.

Which is not really enough for understanding what's happening.

After a sort, there is no way of telling where the data that was stored on the crashed node came from without storing some additional information. So if a crash happens after a sort, is the entire lineage re-executed or is there some mechanism reducing the computational overhead? And what about other wide dependencies?

Dezi
  • 172
  • 1
  • 12

1 Answers1

1

RDD dependencies are actually in terms of partitions and how they are created from partitions of other RDDs.

A wide dependency means the data required to create a partition is obtained from more than one partitions(from same or different RDDs). Each partition is assigned an executor.

Now assume, we are joining two RDDs R1 and R2 that have n and m partitions respectively. Also, for the sake of simplicity, let's assume that both R1 and R2 have been calculated by (n x m) different executors. We are going to create a third RDD R3 by joining R1 and R2.

When R3 is being calculated, assume a node containing x executors (out of (n x m) executors) failed for some reason. It doesn't affect the remaining executors and their data on the other nodes.

Only those partitions in R3 which were supposed to be created from those failed x executor's data are affected. And only those x partitions are recreated.

A more detailed visual explanation is available here

Updated: About Spark caching Below URLs should help you understand the whole persistence feature of Spark

code
  • 2,283
  • 2
  • 19
  • 27
  • Coming back to the example I started with: sorting. If a partition of a RDD R2 that was created by sorting R1 is lost, is there any way to avoid sorting the **entire** RDD R1 to get the missing partition of R2? Or from the example you linked: If a partition of G is lost, what **exactly** is recomputed here? Without storing some additional information during the groupBy and the join, I guess **everything** has to be recomputed? – Dezi May 04 '17 at 08:19
  • You are right! But spark does keep track of the entire lineage for each of the partitions and RDDs, so it won't be necessary to re-execute all the untouched partitions. I am doubtful in the case of sorting though. That's where caching and persistence of Spark comes into the picture. To avoid unnecessary recomputation of RDDs in the case of failures. – code May 04 '17 at 08:36
  • So Spark does cache some data that could be used to recover partitions after sorting? Do you know where I could look for information on that? – Dezi May 04 '17 at 15:00
  • I have updated the answer with useful resources. Let me know if those refs were sufficient – code May 04 '17 at 15:18
  • Thank you! Spark could recover from a failure after sorting if the sorted RDD is persisted using one of the "_2" storage levels. So I guess that means that without the user taking any action, Spark would have to recompute all the RDDs that lead to the sorted RDD. – Dezi May 05 '17 at 15:51