An important design goal in scale-out structured storage systems (“NoSQL” systems) is availability. Availability is usually defined as 1 – (MTTR/(MTTF + MTTR)). MTTR, the Mean-Time-To-Repair, is the duration for which a data item is not available for reads or writes when a node fails. MTTF is the Mean-Time-To-Failure.
We did a simple experiment measuring MTTR for both Spinnaker and HBase and learned some interesting stuff, but we didn’t have room for it in the paper.
To compare HBase and Spinnaker, we used the following setup: a single client picks a random node in the cluster and starts writing data. Consecutive keys are used and a single random value of 1KB is written. After a predetermined amount of data is written, we trigger a failure of the node that was processing the writes for this key-range. We do this by killing the leader of the cohort in Spinnaker. For HBase, we killed the regionserver process for the key-range.
We measured the time taken between the first failed write and the first successful write after that as the unavailability window experienced by the client. In both cases, we split this unavailability window into two parts: the time taken to detect a failure and the time taken to actually recover. Both systems exploit Zookeeper for failure detection. The average time to detect a failure can be configured by adjusting the Zookeeper session timeouts and is set to 2 seconds. In the experiment below, we subtract the failure detection times and only report the actual recovery times.
The picture above shows that for HBase, the recovery time grows linearly with the amount of data that was written to the system before a failure occurred. For Spinnaker, the recovery time remains constant. Neat! This is not unexpected: when a node fails, the regions served by the failed regionserver become unavailable for both reads and writes. HBase needs to carry out multiple steps: First, the master node reads the log file (available in HDFS) of the failed regionserver and splits it into separate files for each region. Each of these regions is assigned to a healthy regionserver. These regionservers now replay the log file, construct the appropriate memtable, and make that region available for reads and writes. The total recovery time includes the time taken to first split the log, and then replay it before opening up for updates. This is clearly proportional to the size of the log. This number can be decreased by aggressively flushing the in-memory data structures to on-disk SSTables more frequently (decreasing the memtable flush limit) and checkpointing. While this will reduce the unavailability window, it severely affects write performance. Typical configurations suggest using around half the available memory for the memtables. With 16 gigabytes of memory on each server, the unavailability window on failure is over 13 minutes. As the available memory on a server increases, this window gets longer.
In the case of Spinnaker, the ranges on a failed master become
unavailable for writes until a new master is elected. The replication protocol in Spinnaker ensures that the new master only needs to re-propose the values since the last-known committed value. This is often less than 1 second’s worth of data, and is independent of the size of the log. As the picture shows, the unavailability window of Spinnaker is relatively constant irrespective of the amount of data written.
Spinnaker continues to be available to service weak reads from the other members in the cohort. Strongly consistent reads are serviced by the master, and therefore are unavailable until a new master is elected, just like in the case of writes. In contrast, HBase is unavailable for both reads and writes. Although, I suspect it wouldn’t be too difficult to modify HBase to return a stale read while the recovery is running.
This interesting property is a consequence of the fact that the HBase design does not provide for a “warm standby” and delegates replication and consistency entirely to the filesystem (HDFS). While I’m not aware of any published work, I suspect there are interesting ways in which HDFS and the HBase design can be modified reduce the unavailability window when a node fails.
Eventually consistent systems like Cassandra continue to be available for both reads and writes after failures by sacrificing consistency. By definition, their MTTR is 0 as long as any node in the system is alive, and therefore the availability is 100%. We did not include Cassandra in this experiment.