13.4. 13.4 Fault-tolerant consensus

The algorithms presented so far are based on the assumption that the system on which they run is reliable. Here we present selected algorithms for unreliable distributed systems, where the active (or correct) processors need to coordinate their activities based on common decisions.

It is inherently difficult for processors to reach agreement in a distributed setting prone to failures. Consider the deceptively simple problem of two failure-free processors attempting to agree on a common bit using a communication medium where messages may be lost. This problem is known as the two generals problem. Here two generals must coordinate an attack using couriers that may be destroyed by the enemy. It turns out that it is not possible to solve this problem using a finite number of messages. We prove this fact by contradiction. Assume that there is a protocol used by processors and involving a finite number of messages. Let us consider such a protocol that uses the smallest number of messages, say messages. Assume without loss of generality that the last message is sent from to . Since this final message is not acknowledged by , must determine the decision value whether or not receives this message. Since the message may be lost, must determine the decision value without receiving this final message. But now both and decide on a common value without needing the message. In other words, there is a protocol that uses only messages for the problem. But this contradicts the assumption that is the smallest number of messages needed to solve the problem.

In the rest of this section we consider agreement problems where the communication medium is reliable, but where the processors are subject to two types of failures: crash failures, where a processor stops and does not perform any further actions, and Byzantine failures, where a processor may exhibit arbitrary, or even malicious, behaviour as the result of the failure. The algorithms presented deal with the so called consensus problem, first introduced by Lamport, Pease, and Shostak. The consensus problem is a fundamental coordination problem that requires processors to agree on a common output, based on their possibly conflicting inputs.

13.4.1. 13.4.1 The consensus problem

We consider a system in which each processor has a special state component , called the input and , called the output (also called the decision). The variable initially holds a value from some well ordered set of possible inputs and is undefined. Once an assignment to has been made, it is irreversible. Any solution to the consensus problem must guarantee:

  • Termination: In every admissible execution, is eventually assigned a value, for every nonfaulty processor .

  • Agreement: In every execution, if and are assigned, then , for all nonfaulty processors and . That is nonfaulty processors do not decide on conflicting values.

  • Validity: In every execution, if for some value , for all processors , and if is assigned for some nonfaulty processor , then . That is, if all processors have the same input value, then any value decided upon must be that common input.

Note that in the case of crash failures this validity condition is equivalent to requiring that every nonfaulty decision value is the input of some processor. Once a processor crashes it is of no interest to the algorithm, and no requirements are put on its decision.

We begin by presenting a simple algorithm for consensus in a synchronous message passing system with crash failures.

13.4.2. 13.4.2 Consensus with crash failures

Since the system is synchronous, an execution of the system consists of a series of rounds. Each round consists of the delivery of all messages, followed by one computation event for every processor. The set of faulty processors can be different in different executions, that is, it is not known in advance. Let be a subset of at most processors, the faulty processors. Each round contains exactly one computation event for the processors not in and at most one computation event for every processor in . Moreover, if a processor in does not have a computation event in some round, it does not have such an event in any further round. In the last round in which a faulty processor has a computation event, an arbitrary subset of its outgoing messages are delivered.


       Code for processor , .              Initially        round ,   1  
                      {  has not already sent } to all processors   2  
                      from , ,    3     4  

In the previous algorithm, which is based on an algorithm by Dolev and Strong, each processor maintains a set of the values it knows to exist in the system. Initially, the set contains only its own input. In later rounds the processor updates its set by joining it with the sets received from other processors. It then broadcasts any new additions to the set of all processors. This continues for rounds, where is the maximum number of processors that can fail. At this point, the processor decides on the smallest value in its set of values.

To prove the correctness of this algorithm we first notice that the algorithm requires exactly rounds. This implies termination. Moreover the validity condition is clearly satisfied since the decision value is the input of some processor. It remains to show that the agreement condition holds. We prove the following lemma:

Lemma 13.12 In every execution at the end of round , , for every two nonfaulty processors and .

Proof. We prove the claim by showing that if at the end of round then at the end of round .

Let be the first round in which is added to for any nonfaulty processor . If is initially in let . If then, in round sends to each , causing to add to , if not already present.

Otherwise, suppose and let be a nonfaulty processor that receives for the first time in round . Then there must be a chain of processors that transfers the value to . Hence sends to in round one etc. until sends to in round . But then is a chain of processors. Hence at least one of them, say must be nonfaulty. Hence adds to its set in round , contradicting the minimality of .

This lemma together with the before mentioned observations hence implies the following theorem.

Theorem 13.13 The previous consensus algorithm solves the consensus problem in the presence of crash failures in a message passing system in rounds.

The following theorem was first proved by Fischer and Lynch for Byzantine failures. Dolev and Strong later extended it to crash failures. The theorem shows that the previous algorithm, assuming the given model, is optimal.

Theorem 13.14 There is no algorithm which solves the consensus problem in less than rounds in the presence of crash failures, if .

What if failures are not benign? That is can the consensus problem be solved in the presence of Byzantine failures? And if so, how?

13.4.3. 13.4.3 Consensus with Byzantine failures

In a computation step of a faulty processor in the Byzantine model, the new state of the processor and the message sent are completely unconstrained. As in the reliable case, every processor takes a computation step in every round and every message sent is delivered in that round. Hence a faulty processor can behave arbitrarily and even maliciously. For example, it could send different messages to different processors. It can even appear that the faulty processors coordinate with each other. A faulty processor can also mimic the behaviour of a crashed processor by failing to send any messages from some point on.

In this case, the definition of the consensus problem is the same as in the message passing model with crash failures. The validity condition in this model, however, is not equivalent with requiring that every nonfaulty decision value is the input of some processor. Like in the crash case, no conditions are put on the output of faulty processors.

13.4.4. 13.4.4 Lower bound on the ratio of faulty processors

Pease, Shostak and Lamport first proved the following theorem.

Theorem 13.15 In a system with processors and Byzantine processors, there is no algorithm which solves the consensus problem if .

13.4.5. 13.4.5 A polynomial algorithm

The following algorithm uses messages of constant size, takes rounds, and assumes that . It was presented by Berman and Garay.

This consensus algorithm for Byzantine failures contains phases, each taking two rounds. Each processor has a preferred decision for each phase, initially its input value. At the first round of each phase, processors send their preferences to each other. Let be the majority value in the set of values received by processor at the end of the first round of phase . If no majority exists, a default value is used. In the second round of the phase processor , called the king of the phase, sends its majority value to all processors. If receives more than copies of (in the first round of the phase) then it sets its preference for the next phase to be ; otherwise it sets its preference to the phase kings preference, received in the second round of the phase. After phases, the processor decides on its preference. Each processor maintains a local array pref with entries.

We prove correctness using the following lemmas. Termination is immediate. We next note the persistence of agreement:

Lemma 13.16 If all nonfaulty processors prefer at the beginning of phase , then they all prefer at the end of phase , for all , .

Proof. Since all nonfaulty processors prefer at the beginning of phase , they all receive at least copies of (including their own) in the first round of phase . Since , , implying that all nonfaulty processors will prefer at the end of phase .


       Code for processor , .       Initially , for any        round ,   1  
                      to all processors   2  
                      from  and assign to , for all ,    3  let maj be the majority value of ( if none)   4  let mult be the multiplicity of maj        round ,    5  
                        THEN SEND
                      to all processors   7  
                      from  ( if none)   8  

This implies the validity condition: If they all start with the same input they will continue to prefer and finally decide on in phase . Agreement is achieved by the king breaking ties. Since each phase has a different king and there are phases, at least one round has a nonfaulty king.

Lemma 13.17 Let be a phase whose king is nonfaulty. Then all nonfaulty processors finish phase with the same preference.

Proof. Suppose all nonfaulty processors use the majority value received from the king for their preference. Since the king is nonfaulty, it sends the same message and hence all the nonfaulty preferences are the same.

Suppose a nonfaulty processor uses its own majority value for its preference. Thus receives more than messages for in the first round of phase . Hence every processor, including receives more than messages for in the first round of phase and sets its majority value to . Hence every nonfaulty processor has for its preference.

Hence at phase all processors have the same preference and by Lemma 13.16 they will decide on the same value at the end of the algorithm. Hence the algorithm has the agreement property and solves consensus.

Theorem 13.18 There exists an algorithm for processors which solves the consensus problem in the presence of Byzantine failures within rounds using constant size messages, if .

13.4.6. 13.4.6 Impossibility in asynchronous systems

As shown before, the consensus problem can be solved in synchronous systems in the presence of both crash (benign) and Byzantine (severe) failures. What about asynchronous systems? Under the assumption that the communication system is completely reliable, and the only possible failures are caused by unreliable processors, it can be shown that if the system is completely asynchronous then there is no consensus algorithm even in the presence of only a single processor failure. The result holds even if the processors only fail by crashing. The impossibility proof relies heavily on the system being asynchronous. This result was first shown in a breakthrough paper by Fischer, Lynch and Paterson. It is one of the most influential results in distributed computing.

The impossibility holds for both shared memory systems if only read/write registers are used, and for message passing systems. The proof first shows it for shared memory systems. The result for message passing systems can then be obtained through simulation.

Theorem 13.19 There is no consensus algorithm for a read/write asynchronous shared memory system that can tolerate even a single crash failure.

And through simulation the following assertion can be shown.

Theorem 13.20 There is no algorithm for solving the consensus problem in an asynchronous message passing system with processors, one of which may fail by crashing.

Note that these results do not mean that consensus can never be solved in asynchronous systems. Rather the results mean that there are no algorithms that guarantee termination, agreement, and validity, in all executions. It is reasonable to assume that agreement and validity are essential, that is, if a consensus algorithm terminates, then agreement and validity are guaranteed. In fact there are efficient and useful algorithms for the consensus problem that are not guaranteed to terminate in all executions. In practice this is often sufficient because the special conditions that cause non-termination may be quite rare. Additionally, since in many real systems one can make some timing assumption, it may not be necessary to provide a solution for asynchronous consensus.


13.4-1 Prove the correctness of algorithm Consensus-Crash .

13.4-2 Prove the correctness of the consensus algorithm in the presence of Byzantine failures.

13.4-3 Prove Theorem 13.20.