Friday, April 20, 2012

Chain Replication and Consensus

A few weeks ago, I had a chance to chat with one of the grad students who worked on the FAWN project at CMU. FAWN is a pretty cool systems project and their original paper is on my NoSQL reading list. FAWN uses chain-replication to provide high-availability. I wanted to understand the gap between chain-replication and consensus-based approach like in Spinnaker. The group also has a project called Ouroboros that extends chain-replication with non-blocking node addition and removal protocols, but the properties I discuss here apply to basic chain replication as well as Ouroboros.
 
Now, as described in the paper, chain replication needs f + 1 nodes to tolerate f failures. Paxos needs 2f + 1 nodes to tolerate f failures. This is a nice advantage for chain replication, but what is it really giving up? Turns out that the answer is that this technique leaves it vulnerable to certain failure sequences.

If we have f + 1 failures at once -- such as the whole cluster losing power because of a data center outage -- we might end up in an inconsistent state with the FAWN strategy.  Here's an example: Once power is restored, say one of the nodes in a group of three, say A comes up. The other two (B and C) have trouble, and have not yet come up. The node A is designated the head of the chain (by an external master), and starts accepting updates. This is okay since chain replication in FAWN allows you to accept writes with a single node (this is not described in the paper, but in a personal communication). Now A fails. One of the other nodes (say B) now comes up, and gets designated the master B. Recall that B only had a transient failure, and has most of the state except the recent writes that were accepted by A. However, if B serves reads, it will serve stale data and the client may notice that the system has lost some of its updates. If B now accepts updates, things could get very difficult to resolve once A comes back up.

This is rather obvious once you think about it:  you can operate in the "read quorum + write quorum < replica group size" region if you are willing to assume that certain failure sequences will not happen. When such a failure sequence happens, you lose consistency! If we're willing to risk the fact that we will never have a cluster-wide outage, FAWN-style chain replication can provide great performance and non-blocking reconfiguration for up to f failures in a replica group with f+1 members. Apart from being more available for reads and writes in the presence of failures, chain replication is a easier to implement than a a consensus protocol. Are there other relaxations from consensus that would be useful if we knew that certain kinds of failure sequences would be exceedingly rare? What if we could assume that two nodes in a replica group would not fail within k minutes of each other?

On a related note, there's a PODC paper from MSR in 2009 that talks about how you can view chain replication as a special case of Vertical Paxos. But in this paper, the write quorum here has to be the entire replica group -- that would mean the system would block for writes if any node failed, but could serve reads as long as at least one node was up. With this more conservative protocol, of course, chain replication would tolerate the same failure sequences that regular Paxos would.