Sandeep's Research Notes

Tuesday, September 25, 2012

Google Spanner

Google has a paper describing Spanner, the successor to BigTable, that will be presented at OSDI in October this year. Spanner is a globally distributed database and the paper is full of very interesting insights into the problems some of Google's best systems engineers have been tackling over the last five years. Here are my takeaways so far:

1. Spanner was designed to tackle some of the shortcomings of BigTable: a) cross-datacenter replication with the right consistency options, and b) support for application dealing with complex evolving schemas and full-transactions.

2. A critical design difference from BigTable is that Spanner does not use GFS, the distributed filesystem, for replication. Instead, replication is handled by sharding the key space over Paxos cohorts. This is very similar to the replication design used in Spinnaker (I presented this at VLDB 2011). I have described the advantages of this approach over using a distributed filesystem in a previous post. This clearly allows the engineers of Spanner much more control over performance, consistency, and availability trade-offs. Spanner goes well beyond changing just the replication architecture ...

3. Transactions: Spanner supports ACID transactions. Every replica that is a leader in the Paxos cohort also keeps a lock table to implement concurrency control. There's a transaction manager at every spanserver that can use this lock table. Details of the concurrency control algorithms are in Section 4 in the paper. Since each transaction manager's state is made highly available with Paxos, distributed transactions can use 2-Phase commit without having to worry about the lost leader problem. Gray and Lamport described some ideas on combining Paxos with 2-Phase commit for highly available distributed transactions in 2004.

4. TrueTime API: The paper talks about a novel approach for coordinating clocks across the planet to enable features like: a) externally consistent transactions, b) lock-free read-only transactions, and c) non-blocking reads of the past. The fascinating part here is that to get good bounds on clock uncertainty and guarantee high availability, they engineered the TrueTime algorithms around GPS clocks and atomic clocks hooked up to time-servers in each datacenter!

While Spanner does a lot more than the BigTable/GFS combination, the best thing about those systems was the simplicity -- Google engineers cleverly choose not to build certain features so they could scale and at the same time support a large class of applications. With Spanner, they can probably support most database applications, but the complexity of the system is substantially greater. One would have to understand Chubby, GFS, BigTable, *and* database recovery and concurrency control algorithms to appreciate all the complexity that goes into building a system like Spanner.

Wednesday, August 22, 2012

Amazon Glacier

Amazon recently introduced Glacier, a cloud data archival service. Based on James Hamilton's post, is presumably built on a large, shared, tape infrastructure. With the addition of this service, Amazon now provides a very compelling set of storage services to suit a whole variety of application needs.

Service	Purpose	Pricing (As of August 2012)
ElastiCache	Memcached service	9 cents/hour for 1.7GB node ( = 3800 cents/GB-month, approx.)
DynamoDB	Low latency, guaranteed throughput, indexed, SSD-based NoSQL store	100 cents/GB-month
SimpleDB	Older, more complex NoSQL store	25 cents/GB-month
EC2 Local Storage	Instance-local storage	Approx 36 cents/GB-month
EBS	Reliable block store	10 cents/GB-month
S3	Cross-datacenter durable storage	12.5 cents/GB-month or, for reduced redundancy, 9.3 cents/GB-month
Glacier	Archival storage	1 cent/GB-month

Each service has a different pricing structure based on the I/O requests you subject it to, and different guarantees about how long the requests might take. But if you look at it purely from a storage cost standpoint, you have a clear idea of how much it costs a developer to keep a piece of data in memory, on SSDs, on local on spinning disks, in-datacenter-redundant disks, cross-datacenter-redundant disks, or on tape!

Wednesday, August 8, 2012

Scala, DSLs, and Big Data

I've been playing with Scala for a bunch of different projects the last several months. One of the most interesting things Scala offers for large-scale data management is the possibility of easy-to-build data processing DSLs (Domain Specific Languages).

Why do we suddenly need DSLs? SQL has served us well for the last several decades, and perhaps will continue to do so for most conventional business query processing needs. However, several new classes of data processing problems seem to require programmatic data access, not just query processing. SQL is a great language when your data processing application looks like it needs to perform joins, selections, projections, and aggregation. A vast majority of data processing falls under this category. When the task at hand is substantially different, such as assembling and analyzing DNA sequences, understanding sentiment about various topics from social media feeds, recommending content based on click streams and ratings, or building up ETL workflows (a task that SQL systems have traditionally been bad at), or statistical/predictive analysis - SQL is neither convenient nor a particularly effective interface for expressing the logic involved. Sure, you can solve the problem using SQL + UDFs + scripting. But there are more convenient ways to do it. Besides, SQL optimizers work hard to figure out a good join order and join plan, de-correlate nested subqueries when required, push down predicates. If you application doesn't really need these sorts of optimization, the inconvenience of expressing your logic in SQL doesn't buy you anything!

The promise of embedded DSLs in Scala is that you can devise a sub-language that is convenient for solving problems in the domain of interest. At the same time, you might also be able to specify certain domain specific optimizations that might help execute the DSL program more efficiently. However, being an embedded DSL means that snippets of code in this DSL are also valid Scala code. So you still get the benefits of the Scala compiler and the optimizations that the JIT compiler can provide (Scala runs on the JVM). On the easy-to-build front, this also means you don't have to build separate tooling (editors, compilers, debuggers) for the DSL -- you can simply continue to use the tooling around Scala. DSLs provide an interesting new tradeoff between building a library in a host language vs. building an entirely new language.

People have done lots of interesting things in this space:

Scalding (A Scala DSL built on Cascading, popular at Twitter)
ScalOps (From Tyson Condie and others, formerly at Yahoo Research)
Spark (from the AMP Lab at Berkeley)
Delite and OptiML (from Stanford)

The SIGMOD demo I did this year for Clydesdale used a simple SQL-like DSL to express the star schema benchmark. I did this mostly to convince myself (and others) that it was simpler/easier to use this DSL mechanism than build a SQL parser and compiler from scratch. It certainly was. Also it let us use java libraries in our data processing without the inconveniences that UDFs (user defined functions) cause in many databases.

Another project where I found this very useful was as a DSL for analyzing DNA sequences and looking for statistically significant SNPs in large data sets. I haven't yet gotten around to writing that work up in detail, but it was a fun project that showed how a DSL that compiles down to the Spark runtime can handily beat an approach like combining Hadoop and R (such as the one in RHIPE or Revolution R's R/Hadoop) and having the developer write code that spans the two systems in awkward ways.

I expect we'll see more interesting DSLs that compile down to runtimes on Hadoop, MPI, Spark, Pregel/Giraffe and other frameworks in the next few years.

Thursday, June 28, 2012

BigInsights at Vestas

Here's a video describing how IBM BigInsights (the product that is the home for CIF) helps Vestas in analyzing weather data to understand wind energy. A great high-level description of the project is here. CIF, in some small way, is helping the world move towards renewable energy. Pretty neat!

Wednesday, June 27, 2012

CIF: New Features in our Hadoop Column Store

I've been spending some time adding several features to the CIF column store to support requests that BigInsights customers have been making (more on this soon). Here are some of the more interesting ones:

Incremental Load Utilities
One of the first requirements was to be able to support efficient incremental loading of data. If you recall how CIF works, we have a directory per partition, and a file per column under each directory. With incremental load -- there are two options: 1) create a new sub-directory in each partition and add the column files in there or 2) append to the existing files in the partition. Append support in newer versions of HDFS makes option 2) possible. While option 1) is easy, it does lead to the problem of creating many small files over several batches of loading -- this can lead to many performance problems down the road. Option 2) does not create new files, but has its own problems: if a load utility, running as a MapReduce job fails, this may leave the column store in an invalid state with partially appended data in several files in different partitions. Solving this ended up being a surprisingly fun problem -- I'll describe this in detail in another post.

Partitioning
This is a relatively simple feature as far as the storage layer is concerned. Hive also supports this to a certain extent. The idea is to be able to partition the data on some key -- such as time, geography, etc -- and store the records corresponding to each partition separately. The partitions don't necessarily have to be semantic, you could partition by the hash of some key column if all you cared about was load balancing. Semantic partitioning allows us to eliminate some partitions from being scanned based on query predicates. The Jaql-CIF integration layer takes predicates from Jaql queries and exploits those to eliminate partitions from the column store that are guaranteed not to contain any relevant data for the query. The same thing can be done if you're using CIF with Hive or Pig or Cascading or any other layer that produces MapReduce jobs.
Of course, maintaining the partitions over a long time frame as new data gets loaded can get tricky -- so we needed to build utilities to split and merge partitions.

Concurrent Readers and Writers
The mechanisms that we introduced to make incremental load safe included a simple, lightweight form of versioning. As a result, we were able to support concurrent readers and writers (load jobs) on the same dataset with a basic form of isolation. Definitely useful in a busy multi-user setting.

Fine-grained Splits

The initial version of the column store only allowed for an entire partition to be turned into a split. We added an additional index structure that tracks the offests in each column file for the start of every k-th row in the partition. Managing this adds a little bit of overhead, but allows us to produce more splits (to utilize more map slots) if necessary. This turns out to be extremely useful for tasks that are CPU intensive. If it so happens that the data in the partition is sorted by some key, then this index structure can be used as a true coarse-grained index on this key and we can use this to push down split elimination to a lower level than just partition-sized splits.

Wednesday, June 20, 2012

Academic Research and Real-World Impact

Joey Hellerstein (one of my favorite researchers in CS) responds to Matt Welsh's criticism of the academic research model.

Tuesday, June 5, 2012

Clydesdale Performance Graph

Clydesdale: SIGMOD Demo

An interactive chart comparing Clydesdale and Hive on the Star-Schema Benchmark that I built for a SIGMOD demo. Scale Factor = 1000, Cluster = 9 worker nodes + 1 master, 8 local disks per node