Sandeep's Research Notes: spinnaker

A few weeks ago, I had a chance to chat with one of the grad students who worked on the FAWN project at CMU. FAWN is a pretty cool systems project and their original paper is on my NoSQL reading list. FAWN uses chain-replication to provide high-availability. I wanted to understand the gap between chain-replication and consensus-based approach like in Spinnaker. The group also has a project called Ouroboros that extends chain-replication with non-blocking node addition and removal protocols, but the properties I discuss here apply to basic chain replication as well as Ouroboros.

Now, as described in the paper, chain replication needs f + 1 nodes to tolerate f failures. Paxos needs 2f + 1 nodes to tolerate f failures. This is a nice advantage for chain replication, but what is it really giving up? Turns out that the answer is that this technique leaves it vulnerable to certain failure sequences.

If we have f + 1 failures at once -- such as the whole cluster losing power because of a data center outage -- we might end up in an inconsistent state with the FAWN strategy. Here's an example: Once power is restored, say one of the nodes in a group of three, say A comes up. The other two (B and C) have trouble, and have not yet come up. The node A is designated the head of the chain (by an external master), and starts accepting updates. This is okay since chain replication in FAWN allows you to accept writes with a single node (this is not described in the paper, but in a personal communication). Now A fails. One of the other nodes (say B) now comes up, and gets designated the master B. Recall that B only had a transient failure, and has most of the state except the recent writes that were accepted by A. However, if B serves reads, it will serve stale data and the client may notice that the system has lost some of its updates. If B now accepts updates, things could get very difficult to resolve once A comes back up.

This is rather obvious once you think about it: you can operate in the "read quorum + write quorum < replica group size" region if you are willing to assume that certain failure sequences will not happen. When such a failure sequence happens, you lose consistency! If we're willing to risk the fact that we will never have a cluster-wide outage, FAWN-style chain replication can provide great performance and non-blocking reconfiguration for up to f failures in a replica group with f+1 members. Apart from being more available for reads and writes in the presence of failures, chain replication is a easier to implement than a a consensus protocol. Are there other relaxations from consensus that would be useful if we knew that certain kinds of failure sequences would be exceedingly rare? What if we could assume that two nodes in a replica group would not fail within k minutes of each other?

On a related note, there's a PODC paper from MSR in 2009 that talks about how you can view chain replication as a special case of Vertical Paxos. But in this paper, the write quorum here has to be the entire replica group -- that would mean the system would block for writes if any node failed, but could serve reads as long as at least one node was up. With this more conservative protocol, of course, chain replication would tolerate the same failure sequences that regular Paxos would.

Here's a great paper from Bolosky and others at Microsoft that demonstrates that Paxos can indeed be used as a high-volume replication protocol. They argue that a Paxos-based storage service can provide performance close to what the underlying hardware can support. You don't have to resort to simple primary-backup schemes which make it difficult to deal with arbitrary machine restarts. Also, you don't have to give up sequential consistency for performance and deal with the complications of eventual consistency. The crux of their argument is: for a system that is in a single datacenter, and needs to use commodity networking and disks, the Paxos implementation will certainly not be the bottleneck.

They implemented this in the context of a storage system called Gaios. The paper has plenty of implementation details and performance results. They even ran an OLTP benchmark on SQL Server configured to use Gaios storage. Neat stuff!

Spinnaker exploits the same ideas as Gaios, but the exposes a user-programmable key-value store API instead of building scale-out storage. The results from Gaios independently verify the arguments we tried to make in the Spinnaker paper -- you can use a consensus algorithm for data replication in a scale-out system without sacrificing performance.

Sandeep's Research Notes

Friday, April 20, 2012

Chain Replication and Consensus

Friday, July 15, 2011

Spinnaker, Paxos, and Gaios