How to classify a failure detector?

Question

I understand that failure detectors in asynchronous systems are basically classified as (eventually)perfect/(eventually)strong and how those classes are defined, but I kind of struggle to get the intuition behind it.

Suppose I have a concrete implementation of a failure detector, which periodically listens for heartbeat messages from each process. If a process hasn't sent its heartbeat message for a while, the process will be added to a list of suspects until a message is received from the process.

Now, how do I know which class is this implementation of an FD? Would that require a formal proof of the FD's completeness/accuracy properties? If a perfect FD can be implemented, why bother studying other (weaker) ones? Or are the classes only "assumed" when designing fault-tolerant distributed algorithms?

I am a bit puzzled by this (how to actually classify a given (concrete) FD). I will appreciate any answers.

danyhow · Accepted Answer · 2015-03-16T09:43:54.457

0

You first need to model the synchrony of the processes and of the links between them; for example: "all processes can eventually communicate in a timely manner, messages are transmitted within a known time bound, and processes execute deadlines within a known time bound". Once you define such a model, you can analyze a specific algorithm and determine its class (and prove it).

The different classes of failure detectors are useful to encapsulate and abstract away from such underlying assumptions when designed higher-level algorithms. They can also be used to determine what problems (consensus, broadcast, weak leader election, etc) are harder/easier to solve depending on the required failure detector class.

In contrast to what is stated in your question, a perfect FD cannot be implemented in any system model. Actually, one active area of research is in finding the minimal synchrony requirements such that, e.g., an omega failure detector can be implemented (see "Omega meets Paxos" paper).

You can imagine diverse scenarios where synchrony is only partial, e.g., some links are too unreliable, some processes are behind firewalls (outgoing messages allowed, but no ingoing messages), etc. When you model the synchrony of concrete deployments and then answer the question of what FD can be built on such a model, you are at the same time answering what problems can be solved in that model (and consequently in that deployment).

edited Mar 16 '15 at 09:43

answered Mar 16 '15 at 09:34

danyhow

872
7
10

Right, so just to clarify, the actual class depends on the model of the network? Also, do people then design algorithms for a particular FD class rather then for a particular network model? I.e. rather than saying "I've devised an algorithm for solving problem P on a network with all these assumptions A..." they would say "I've devised an algorithm for solving P using a failure detector of class C"? (where C can be implemented on a model with assumptions A). Hope this is clear. Thanks for your answer btw! – Andy Scott Mar 19 '15 at 13:46
Just note that "the actual class depends on the model of the network" should read "the actual class depends on the system model", where system model refers to the network *and* processes. (because if the network is perfectly synchronous but processes are never timely, you might be impossible to build some failure detector classes) – danyhow Mar 19 '15 at 14:55
Oh ok. Could you also please clarify if those statements I made in my previous comment are true? – Andy Scott Mar 21 '15 at 21:04
Yes. When people use the failure detector abstraction, they design their algorithm assuming they have access to a failure detector of a given class. Perhaps [this survey](http://infoscience.epfl.ch/record/138592/files/fdsurvey.pdf) helps you further. – danyhow Mar 26 '15 at 14:11

How to classify a failure detector?

1 Answers1