Tuesday, January 10, 2012

Clydesdale: SQL on Hadoop

Clydesdale is an experimental system we built at Almaden for processing structured data on Hadoop. It is obvious that structured data processing is becoming an increasingly important workload on Hadoop clusters. While Hive is great at letting people use SQL to process large amounts of structured data on Hadoop, it is surprisingly inefficient. Researchers have recently argued that there is more than an order of magnitude inefficiency for relational-like workloads on Hive (see this paper). Clydesdale is a new way of tackling this problem. In a paper that was recently accepted to EDBT 2012, we showed that Clydesdale is 38x faster than Hive on the popular Star-Schema Benchmark.

Clydesdale is aimed at workloads where the data fits a star schema. A substantial fraction of structured data repositories follow either a star schema or a snowflake schema. Clydesdale is specialized to perform well for these, but it can work on any arbitrary schema. We draw on many well known techniques from the parallel DBMS world like column oriented storage, tailored join-plans, and multi-core execution strategies to achieve dramatic performance improvements over existing approaches. The best part is that Clydesdale works on a stock Hadoop installation -- no modifications are needed to the underlying implementation! Using the star schema benchmark, we show that Clydesdale is 5x to 89x faster for different queries, averaging a 38x advantage. I'll post a link to the paper with all the details as soon as I have the camera-ready version finalized for EDBT. 

Here's a high-level overview of the architecture and the query processing strategy: The fact table is stored on HDFS using CIF, a columnar storage approach described in our earlier VLDB paper. (I've also previously discussed CIF here and here.) A master copy of the dimension tables is also stored in HDFS, additionally, they are also replicated to the local storage on each node in the cluster. SQL queries submitted to Clydesdale turn into MapReduce jobs. Join processing is done in the map phase: mappers apply predicates to the dimension tables and build hash tables in the setup phase, then, they join the rows of the fact table by probing the hash tables and emitting those that qualify the join. The reduce phase is responsible for grouping and aggregation. Clydesdale draws on several strategies for performance improvement: careful use of multi-core parallelism, employing a tailored star-join plan instead of joining tables two-by-two, columnar storage and -- to a limited extent -- columnar execution. None of these techniques is singularly responsible for the overall improvement -- each of them contribute to the 38x overall speedup we measured over Hive.

Clydesdale's central contribution is a demonstration that a simple architecture that leverages existing techniques from parallel DBMSs can provide substantial benefits for structured data processing on MapReduce. The experimental results demonstrate that for certain structured data processing workloads, it is not necessary to resort to more radical approaches like discarding HDFS and embedding a relational database (such as in HadoopDB); or requiring unusual storage organization and join processing techniques (such as in Llama). Clydesdale layers on an unmodified Hadoop installation and provides major performance improvements for structured data processing.