Sandeep's Research Notes: February 2012

Thursday, February 16, 2012

A NoSQL Reading List

I've been putting together a list of papers for interns, post-docs, and other new folks at Almaden who are getting interested in the NoSQL/NewSQL space. Here's what I have so far. Any suggestions/additions to the list would be welcome -- leave them around as a comment.

Basic System Design
BigTable, Dynamo, and PNUTS are the three big systems running real applications and should be required reading. FAWN is an excellent read. The HStore paper talks about traditional OLTP, but is definitely closely related. Hyder is also in the traditional OLTP space and is probably less relevant to the NoSQL space than HStore, but this is a very cool system and explores a major departure from traditional OLTP system design. The RamCloud effort from Stanford has made some interesting choices that allow very good availability numbers. Finally, I'll include a shameless plug for my own paper, Spinnaker, which I think makes some interesting points about this space.

Consistency, Indexing, Transactions

PNUTS Indexing and Views
Google Megastore
Helland's CIDR 2007 paper, and the one in 2009 with Campbell. I've had so much fun reading these.
Dave Lomet's take on transactions in the cloud
Abadi's musings on CAP and PACELC
Hellerstein's group on quantifying eventual consistency
PiQL , a project from Berkeley, makes some interesting arguments on guaranteeing bounded time querying:

Other Industrial/Open-Source Systems and Articles

Curt Monash's valiant struggle to distinguish between traditional OLTP and "NoSQL" apps: HVSP
Industrial MySQL Scaleout approaches: Clustrix, Schooner , dbShards
Oracle NoSQL Whitepaper
HBase, Cassandra, MongoDB -- codebases/architecture

Tuesday, February 14, 2012

DynamoDB = Cassandra-as-a-Service ?

Amazon recently introduced a new database service: DynamoDB to add to its existing set of database services: SimpleDB and RDS (the hosted MySQL service). DynamoDB seems to be a higher-scale, lower function version of SimpleDB: it does away with the limitation that data be partitioned into 10GB sized SimpleDB domains. It also provides much more predictable performance and you can provision for your expected read and write rates. In exchange, you give up the automatic indexing ability that SimpleDB had. The querying ability goes away too -- you fall back to get/put requests addressed by primary keys. The conditional put/delete calls are still there as a form of simple concurrency control.

DynamoDB comes pretty close to being a pay-as-you-go version of Cassandra. Not surprising, since at least one of the original developers of Cassandra (Avinash Lakshman) actually worked on the Dynamo project at Amazon before joining Facebook.

The major Cassandra advantage seems to be secondary indexes. I'm not sure how important local secondary indexes are for short-request apps, but it could be useful to do predicate scans when you hook up MapReduce to the data in Cassandra. The major feature that DynamoDB has that Cassandra is missing is conditional put/delete calls. This is a rather tricky feature -- Jonathan Ellis (Datastax), Jun Rao, and I tried to add this in the early days of Cassandra (https://issues.apache.org/jira/browse/CASSANDRA-48) and quickly realized it was a good bit harder to provide clean semantics for this call on an eventually consistent datastore.

Jonathan Ellis at DataStax posted a feature-comparison of Cassandra and DynamoDB at http://www.datastax.com/dev/blog/amazon-dynamodb. The short summary is: if you are running things on the cloud, then DynamoDB is indeed a compelling choice. If you are a very large operation, and want a multi-datacenter deployment, expect to need more functionality than DynamoDB (snapshots, integrated caching), and have an ops team that can manage it, then Cassandra is probably a better bet.

Wednesday, February 8, 2012

Clydesdale Overview

Here's a quick preview of the architecture of Clydesdale and details on its performance. I'll link to the camera-ready version of the EDBT paper soon. The big-picture idea in Clydesdale is the same as Hive: a SQL-like interface for processing structured data on Hadoop. The target space where this is likely to be most useful is large investigative marts. (See my older post and this post from Monash for backgroud.)

The clients submit a SQL query to Clydesdale, the compiler turns it into one or more MapReduce jobs. (There are ways to access Clydesdale using more programmatic APIs, but that's for another post :) )The fact table is on HDFS, stored using CIF. A master copy of the dimension tables is available in HDFS. Dimension tables are also cached on the local storage of each node. The picture below shows the general architecture -- Clydesdale layers neatly on top of unmodified Hadoop.

Clydesdale is focused on processing queries over star schema data sets. It uses several techniques like columnar storage, block iteration, multi-core aware operators, and a multi-way map-side join plan. This combination of techniques adds up to a massive advantage over Hive, especially for the star schema benchmark. The figure below shows the execution time for each of the queries in the benchmark using 1) Clydesdale, 2) the repartition join strategy in Hive, and 3) the map-join strategy in Hive. Depending on the query, Clydesdale is 5x to 83x faster than the best Hive plan, averaging about 38x across the benchmark. These are numbers on a 9-node cluster running the star-schema benchmark at scale factor 1000. The general trend holds for larger clusters and larger datasets.

To be fair, Hive is general purpose and not really focused on doing well on star-schema workloads. The comparison is not intended as a criticism of Hive, but as evidence for how much performance potential is available on the Hadoop platform for structured data processing! While Hadoop was originally designed for large parallel programming tasks that were more compute-intensive (eg. analysing web pages and building search indexes), the platform is not so bad when it comes to structured data processing. The ideas in Clydesdale can be used to tremendously improve the performance of Hive and similar systems.

What does this mean for the larger Hadoop ecosystem? This is very good news for BI (Business Intelligence) vendors on Hadoop like Pentaho, Datameer, Platfora, and others. If you rely on SQL-like processing to deliver business insights, and the data lies in Hadoop, we can now do a whole lot better than today's Hive performance!