Thursday, January 22, 2015

The Tail at Scale

Another blog post recommending a paper to read ....

One of the challenges of building large systems on shared infrastructure like AWS and the Google Cloud is that you may have to think harder about dealing with variations in response time than you might if you were designing for dedicated hardware (like a traditional data warehouse).  This is obviously critical when you're building latency-critical services that power interactive experiences (websites, mobile apps, video/audio streaming services, etc.). Depending on the operating point, you're likely to see this effect even with batch processing systems like Map-Reduce and large scale machine learning systems.

A CACM article from Jeff Dean and Luiz Andre Barosso -- The Tail at Scale contains many valuable lessons on how to engineer systems to be robust to tail latencies. Some of the interesting ideas from the paper are to use:

  1. Hedged requests: The client sends a request to one replica, and after a short delay, sends a second request to another replica. As soon as one response is received, the other request is cancelled.
  2. Tied requests: The client sends a request to two replicas (also with a short delay) making sure the requests have metadata so that they know about each other. As soon as one replica starts processing a request, it sends a cancellation message to the other replica (this keeps the client out of the loop for the cancel-logic). Proper implementations of hedged requests and tied requests can significantly improve the 99th percentile latency with a negligible (2% - 3%) increase in the number of requests sent.
  3. Oversharding/Micro-Partitions: The portion of data to be served is divided into many more shards than there are machines to serve them. Each server is assigned several shards to manage. This way, load balancing can be managed in a fairly fine-grained manner by moving a small number of shards at a time from one server to another. This allows us better control in managing the variability in load across machines. Systems like Bigtable (and open-source implementations like HBase), for instance, have servers that each manage dozens of shards.
There are lots of other ideas and interesting discussions in the paper. I strongly recommend it!