Sandeep's Research Notes

Monday, March 12, 2012

Clydesdale Updates

The Hadoop Summit is coming up in June in San Jose. Every year, this event gets bigger and more interesting.
I've submitted an abstract to talk about Clydesdale and structured data processing on Hadoop in general. You can vote for it here: https://hadoopsummit2012.uservoice.com/forums/151413-track-1-future-of-apache-hadoop/suggestions/2663590-structured-data-processing-on-hadoop

In other news, our demonstration proposal for Clydesdale got accepted at SIGMOD. If you're going to be at the conference in Arizona in May this year, drop by for a chat and for the latest updates on what we have cooking in this space.

Tuesday, March 6, 2012

Re-Sequencing in Bioinformatics

Genome sequencing is still one of the major problems in bio-informatics where one has to deal with large amounts of data. As advances in technology make it progressively cheaper to sequence an entire genome, scientists are coming up with many interesting questions they can tackle with sequencing.

I had assumed that the computational aspects of sequencing (or more specifically, re-sequencing) was a mostly solved problem that didn't really need much attention from computer scientists anymore. (I haven't spent much time or energy in this space since my grad school days.) Turns out there have been some very interesting developments in this space in the last three years.

Here's the basic problem definition: you get a tissue sample from someone and stick it in a machine. The machine spits out many "reads" that are short strings over the alphabet {A,C,G,T} -- each of these reads is the sequence for some portion of the DNA in the sample. These are noisy measurements, so there may be occasional errors. Over a few hours, the machine spits out many millions of reads. Now, the computational task is to build up the best guess for the whole sequence from which these reads arose given a reference sequence that is "pretty close".

The general idea that many algorithms follow is to align these reads to the reference sequence while tolerating a few mismatches. Then, for each position in the reference sequence, looking up the consensus of all the aligned reads to figure out the base (A, C, G, T) at that position. There are many variants to this approach, some use indexes on the reference sequence, some use indexes on the reads, some use heuristic based approximate alignments, some use a relatively expensive Smith-Waterman style alignments, sometimes there are multiple candidate reference sequences, and the reads themselves could come with some additional constraints (paired-end reads).

Here's a list of recent papers/projects dealing with the basic version of this problem:

Bowtie , Crossbow (parallel Bowtie)
SOAP/SOAP2
BWA
WHAM
SNAP
SSAHA, SSAHA2

SNAP is the newest one out of Berkeley, MSR, and UCSF and promises to be more than 10x faster than the competition. Apart from the ones listed above, there are many proprietary algorithms that come with the sequencing machine you buy. And of course, there's a host of older algorithms that the papers above built on.

There are some interesting data management aspects to this problem beyond just sequencing: different people care to different degrees about the actual alignment method used and how much data is retained. Often, the interesting analyses come after the alignment and SNP-identification, may only have to go back as far as the aligned sequence. Some scientists may want to go back all the way to the reads once they have completed some downstream analysis and want to verify the significance of their findings. If you're a big hospital, and are sequencing thousands of patients -- managing the reads, the aligned sequences, and the results of downstream experiments could easily be a multi-petabyte problem.

Thursday, March 1, 2012

Research Jobs at Almaden

Our group at IBM Almaden is looking to hire short term post-docs (for 2 year stints) as well as permanent research staff. If you're a graduating PhD student who has worked on some aspect of large scale data management research, and are excited about the sorts of problems that I describe on this blog, send me an email!

Thursday, February 16, 2012

A NoSQL Reading List

I've been putting together a list of papers for interns, post-docs, and other new folks at Almaden who are getting interested in the NoSQL/NewSQL space. Here's what I have so far. Any suggestions/additions to the list would be welcome -- leave them around as a comment.

Basic System Design
BigTable, Dynamo, and PNUTS are the three big systems running real applications and should be required reading. FAWN is an excellent read. The HStore paper talks about traditional OLTP, but is definitely closely related. Hyder is also in the traditional OLTP space and is probably less relevant to the NoSQL space than HStore, but this is a very cool system and explores a major departure from traditional OLTP system design. The RamCloud effort from Stanford has made some interesting choices that allow very good availability numbers. Finally, I'll include a shameless plug for my own paper, Spinnaker, which I think makes some interesting points about this space.

Consistency, Indexing, Transactions

PNUTS Indexing and Views
Google Megastore
Helland's CIDR 2007 paper, and the one in 2009 with Campbell. I've had so much fun reading these.
Dave Lomet's take on transactions in the cloud
Abadi's musings on CAP and PACELC
Hellerstein's group on quantifying eventual consistency
PiQL , a project from Berkeley, makes some interesting arguments on guaranteeing bounded time querying:

Other Industrial/Open-Source Systems and Articles

Curt Monash's valiant struggle to distinguish between traditional OLTP and "NoSQL" apps: HVSP
Industrial MySQL Scaleout approaches: Clustrix, Schooner , dbShards
Oracle NoSQL Whitepaper
HBase, Cassandra, MongoDB -- codebases/architecture

Tuesday, February 14, 2012

DynamoDB = Cassandra-as-a-Service ?

Amazon recently introduced a new database service: DynamoDB to add to its existing set of database services: SimpleDB and RDS (the hosted MySQL service). DynamoDB seems to be a higher-scale, lower function version of SimpleDB: it does away with the limitation that data be partitioned into 10GB sized SimpleDB domains. It also provides much more predictable performance and you can provision for your expected read and write rates. In exchange, you give up the automatic indexing ability that SimpleDB had. The querying ability goes away too -- you fall back to get/put requests addressed by primary keys. The conditional put/delete calls are still there as a form of simple concurrency control.

DynamoDB comes pretty close to being a pay-as-you-go version of Cassandra. Not surprising, since at least one of the original developers of Cassandra (Avinash Lakshman) actually worked on the Dynamo project at Amazon before joining Facebook.

The major Cassandra advantage seems to be secondary indexes. I'm not sure how important local secondary indexes are for short-request apps, but it could be useful to do predicate scans when you hook up MapReduce to the data in Cassandra. The major feature that DynamoDB has that Cassandra is missing is conditional put/delete calls. This is a rather tricky feature -- Jonathan Ellis (Datastax), Jun Rao, and I tried to add this in the early days of Cassandra (https://issues.apache.org/jira/browse/CASSANDRA-48) and quickly realized it was a good bit harder to provide clean semantics for this call on an eventually consistent datastore.

Jonathan Ellis at DataStax posted a feature-comparison of Cassandra and DynamoDB at http://www.datastax.com/dev/blog/amazon-dynamodb. The short summary is: if you are running things on the cloud, then DynamoDB is indeed a compelling choice. If you are a very large operation, and want a multi-datacenter deployment, expect to need more functionality than DynamoDB (snapshots, integrated caching), and have an ops team that can manage it, then Cassandra is probably a better bet.

Wednesday, February 8, 2012

Clydesdale Overview

Here's a quick preview of the architecture of Clydesdale and details on its performance. I'll link to the camera-ready version of the EDBT paper soon. The big-picture idea in Clydesdale is the same as Hive: a SQL-like interface for processing structured data on Hadoop. The target space where this is likely to be most useful is large investigative marts. (See my older post and this post from Monash for backgroud.)

The clients submit a SQL query to Clydesdale, the compiler turns it into one or more MapReduce jobs. (There are ways to access Clydesdale using more programmatic APIs, but that's for another post :) )The fact table is on HDFS, stored using CIF. A master copy of the dimension tables is available in HDFS. Dimension tables are also cached on the local storage of each node. The picture below shows the general architecture -- Clydesdale layers neatly on top of unmodified Hadoop.

Clydesdale is focused on processing queries over star schema data sets. It uses several techniques like columnar storage, block iteration, multi-core aware operators, and a multi-way map-side join plan. This combination of techniques adds up to a massive advantage over Hive, especially for the star schema benchmark. The figure below shows the execution time for each of the queries in the benchmark using 1) Clydesdale, 2) the repartition join strategy in Hive, and 3) the map-join strategy in Hive. Depending on the query, Clydesdale is 5x to 83x faster than the best Hive plan, averaging about 38x across the benchmark. These are numbers on a 9-node cluster running the star-schema benchmark at scale factor 1000. The general trend holds for larger clusters and larger datasets.

To be fair, Hive is general purpose and not really focused on doing well on star-schema workloads. The comparison is not intended as a criticism of Hive, but as evidence for how much performance potential is available on the Hadoop platform for structured data processing! While Hadoop was originally designed for large parallel programming tasks that were more compute-intensive (eg. analysing web pages and building search indexes), the platform is not so bad when it comes to structured data processing. The ideas in Clydesdale can be used to tremendously improve the performance of Hive and similar systems.

What does this mean for the larger Hadoop ecosystem? This is very good news for BI (Business Intelligence) vendors on Hadoop like Pentaho, Datameer, Platfora, and others. If you rely on SQL-like processing to deliver business insights, and the data lies in Hadoop, we can now do a whole lot better than today's Hive performance!

Tuesday, January 10, 2012

Clydesdale: SQL on Hadoop

Clydesdale is an experimental system we built at Almaden for processing structured data on Hadoop. It is obvious that structured data processing is becoming an increasingly important workload on Hadoop clusters. While Hive is great at letting people use SQL to process large amounts of structured data on Hadoop, it is surprisingly inefficient. Researchers have recently argued that there is more than an order of magnitude inefficiency for relational-like workloads on Hive (see this paper). Clydesdale is a new way of tackling this problem. In a paper that was recently accepted to EDBT 2012, we showed that Clydesdale is 38x faster than Hive on the popular Star-Schema Benchmark.

Clydesdale is aimed at workloads where the data fits a star schema. A substantial fraction of structured data repositories follow either a star schema or a snowflake schema. Clydesdale is specialized to perform well for these, but it can work on any arbitrary schema. We draw on many well known techniques from the parallel DBMS world like column oriented storage, tailored join-plans, and multi-core execution strategies to achieve dramatic performance improvements over existing approaches. The best part is that Clydesdale works on a stock Hadoop installation -- no modifications are needed to the underlying implementation! Using the star schema benchmark, we show that Clydesdale is 5x to 89x faster for different queries, averaging a 38x advantage. I'll post a link to the paper with all the details as soon as I have the camera-ready version finalized for EDBT.

Here's a high-level overview of the architecture and the query processing strategy: The fact table is stored on HDFS using CIF, a columnar storage approach described in our earlier VLDB paper. (I've also previously discussed CIF here and here.) A master copy of the dimension tables is also stored in HDFS, additionally, they are also replicated to the local storage on each node in the cluster. SQL queries submitted to Clydesdale turn into MapReduce jobs. Join processing is done in the map phase: mappers apply predicates to the dimension tables and build hash tables in the setup phase, then, they join the rows of the fact table by probing the hash tables and emitting those that qualify the join. The reduce phase is responsible for grouping and aggregation. Clydesdale draws on several strategies for performance improvement: careful use of multi-core parallelism, employing a tailored star-join plan instead of joining tables two-by-two, columnar storage and -- to a limited extent -- columnar execution. None of these techniques is singularly responsible for the overall improvement -- each of them contribute to the 38x overall speedup we measured over Hive.

Clydesdale's central contribution is a demonstration that a simple architecture that leverages existing techniques from parallel DBMSs can provide substantial benefits for structured data processing on MapReduce. The experimental results demonstrate that for certain structured data processing workloads, it is not necessary to resort to more radical approaches like discarding HDFS and embedding a relational database (such as in HadoopDB); or requiring unusual storage organization and join processing techniques (such as in Llama). Clydesdale layers on an unmodified Hadoop installation and provides major performance improvements for structured data processing.