Google has a paper describing Spanner, the successor to BigTable, that will be presented at OSDI in October this year. Spanner is a globally distributed database and the paper is full of very interesting insights into the problems some of Google's best systems engineers have been tackling over the last five years. Here are my takeaways so far:
1. Spanner was designed to tackle some of the shortcomings of BigTable: a) cross-datacenter replication with the right consistency options, and b) support for application dealing with complex evolving schemas and full-transactions.
2. A critical design difference from BigTable is that Spanner does not use GFS, the distributed filesystem, for replication. Instead, replication is handled by sharding the key space over Paxos cohorts. This is very similar to the replication design used in Spinnaker (I presented this at VLDB 2011). I have described the advantages of this approach over using a distributed filesystem in a previous post. This clearly allows the engineers of Spanner much more control over performance, consistency, and availability trade-offs. Spanner goes well beyond changing just the replication architecture ...
3. Transactions: Spanner supports ACID transactions. Every replica that is a leader in the Paxos cohort also keeps a lock table to implement concurrency control. There's a transaction manager at every spanserver that can use this lock table. Details of the concurrency control algorithms are in Section 4 in the paper. Since each transaction manager's state is made highly available with Paxos, distributed transactions can use 2-Phase commit without having to worry about the lost leader problem. Gray and Lamport described some ideas on combining Paxos with 2-Phase commit for highly available distributed transactions in 2004.
4. TrueTime API: The paper talks about a novel approach for coordinating clocks across the planet to enable features like: a) externally consistent transactions, b) lock-free read-only transactions, and c) non-blocking reads of the past. The fascinating part here is that to get good bounds on clock uncertainty and guarantee high availability, they engineered the TrueTime algorithms around GPS clocks and atomic clocks hooked up to time-servers in each datacenter!
While Spanner does a lot more than the BigTable/GFS combination, the best thing about those systems was the simplicity -- Google engineers cleverly choose not to build certain features so they could scale and at the same time support a large class of applications. With Spanner, they can probably support most database applications, but the complexity of the system is substantially greater. One would have to understand Chubby, GFS, BigTable, *and* database recovery and concurrency control algorithms to appreciate all the complexity that goes into building a system like Spanner.
The Spanner paper was indeed a delightful read!
ReplyDeleteAlthough Spanner is significantly more complex (internally) than BigTable/GFS, I completely agree with the authors that its much more useful to enable application simplicity (e.g., through ACID transactions and a SQL-like interface) at the cost of internal complexity, rather than the other way round.
Indeed, the complexity of the system justifies the amount of work that went into its design and implementation. Judging by the number of authors and acknowledged contributors and the 5 years of work, this sounds like 100 man-years at least!
Cheers,
Cosmin