Spinnaker and “NoSQL"
Spinnaker is an interesting research project I worked on in the “NoSQL” space. We have a paper describing the system in PVLDB. Since VLDB is still a few months away, I figured I’d write up a quick-and-easy summary, and some interesting perspective that we didn’t have room for in the paper.
Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads.
There are three big architectures in the scale-out NoSQL space: Bigtable, Dynamo, and PNUTS. In Spinnaker, we tried to tackle the question of how one would design a key-value store if we weren’t constrained by any existing systems. The goal was to build a scalable key-value store with a reasonable consistency model, high availability, and good performance.
The Bigtable design, of course, leverages GFS. This made the design for Bigtable simple and elegant. It didn’t have to deal with replication and consistency – the filesystem took care of that. But there is a performance and availability cost associated with letting the filesystem take care of that.
Dynamo was great except for one big problem: eventual consistency. While this is a reasonable choice for super-duper-amazing-high availability, perhaps most applications would rather deal with an easier consistency model rather than eventual consistency? As the application gets more complicated than a shopping cart, programming against eventual consistency gets extremely confusing and tricky. This is a burden we don’t want to place on the application developer unless it is unavoidable.
PNUTS leverages a fault-tolerance pub-sub system that Yahoo had already built. This too probably has an associated scalability and availability cost. The pub-sub system is a central point of communication for all the write traffic and the whole system can only scale as far as this pub-sub system could scale. I’m guessing that PNUTS used a single server, hardened-with-hardware approach to making that pub-sub system work, which could be a scalability bottleneck. There have been a few scalable pub-sub system efforts since then -- Hedwig, Kafka, etc … Then again, we didn’t have one of these lying around, so we asked the question, what’s the ideal design if you were to build one of these key-value stores from scratch?
Spinnaker tries to bring the best of Bigtable, Dynamo, and PNUTS designs together.
Spinnaker doesn’t use a DFS or a central fault-tolerant pub-sub system. Instead, Spinnaker uses a consensus-based replication algorithm and leverages Zookeeper for coordination. We used the Cassandra codebase as the starting point for the Spinnaker design. The node architecture looks very much like Bigtable (with log structured maintenance of on-disk data using SSTables). I won’t go into the details of the architecture here; you can read the paper for that….the interesting finding was that the Spinnaker design can be competitive with an alternative like Cassandra that provides weaker consistency guarantees. Compared to Cassandra, we showed that Spinnaker can be as fast or even faster on reads and only 5% to 10% slower on writes. On node failure, if the node happened to be a ``leader’’ of a replication cohort, Spinnaker suffers short unavailability windows of under 2 seconds for writes to that key-range. Reads continue to be available. In comparison, HBase suffers from substantially longer unavailability windows when a node fails – for both reads and writes. Cassandra is eventually consistent, and therefore always available despite failures.
It was really interesting to see that a consensus based replication algorithm can provide pretty good performance. I do feel that more and more apps that really need to use a large scale-out key-value store probably don’t need the kind of durability guarantees that Spinnaker can provide. Spinnaker can be configured to provide durability of “2 out of 3 memories”. This improves a big boost to latency and throughput … but if that’s really the target, I’d look very carefully at a different system…. look for the next post J