Sandeep's Research Notes: Clydesdale

Showing posts with label Clydesdale. Show all posts

Monday, March 12, 2012

Clydesdale Updates

The Hadoop Summit is coming up in June in San Jose. Every year, this event gets bigger and more interesting.
I've submitted an abstract to talk about Clydesdale and structured data processing on Hadoop in general. You can vote for it here: https://hadoopsummit2012.uservoice.com/forums/151413-track-1-future-of-apache-hadoop/suggestions/2663590-structured-data-processing-on-hadoop

In other news, our demonstration proposal for Clydesdale got accepted at SIGMOD. If you're going to be at the conference in Arizona in May this year, drop by for a chat and for the latest updates on what we have cooking in this space.

Wednesday, February 8, 2012

Clydesdale Overview

Here's a quick preview of the architecture of Clydesdale and details on its performance. I'll link to the camera-ready version of the EDBT paper soon. The big-picture idea in Clydesdale is the same as Hive: a SQL-like interface for processing structured data on Hadoop. The target space where this is likely to be most useful is large investigative marts. (See my older post and this post from Monash for backgroud.)

The clients submit a SQL query to Clydesdale, the compiler turns it into one or more MapReduce jobs. (There are ways to access Clydesdale using more programmatic APIs, but that's for another post :) )The fact table is on HDFS, stored using CIF. A master copy of the dimension tables is available in HDFS. Dimension tables are also cached on the local storage of each node. The picture below shows the general architecture -- Clydesdale layers neatly on top of unmodified Hadoop.

Clydesdale is focused on processing queries over star schema data sets. It uses several techniques like columnar storage, block iteration, multi-core aware operators, and a multi-way map-side join plan. This combination of techniques adds up to a massive advantage over Hive, especially for the star schema benchmark. The figure below shows the execution time for each of the queries in the benchmark using 1) Clydesdale, 2) the repartition join strategy in Hive, and 3) the map-join strategy in Hive. Depending on the query, Clydesdale is 5x to 83x faster than the best Hive plan, averaging about 38x across the benchmark. These are numbers on a 9-node cluster running the star-schema benchmark at scale factor 1000. The general trend holds for larger clusters and larger datasets.

To be fair, Hive is general purpose and not really focused on doing well on star-schema workloads. The comparison is not intended as a criticism of Hive, but as evidence for how much performance potential is available on the Hadoop platform for structured data processing! While Hadoop was originally designed for large parallel programming tasks that were more compute-intensive (eg. analysing web pages and building search indexes), the platform is not so bad when it comes to structured data processing. The ideas in Clydesdale can be used to tremendously improve the performance of Hive and similar systems.

What does this mean for the larger Hadoop ecosystem? This is very good news for BI (Business Intelligence) vendors on Hadoop like Pentaho, Datameer, Platfora, and others. If you rely on SQL-like processing to deliver business insights, and the data lies in Hadoop, we can now do a whole lot better than today's Hive performance!

Tuesday, January 10, 2012

Clydesdale: SQL on Hadoop

Clydesdale is an experimental system we built at Almaden for processing structured data on Hadoop. It is obvious that structured data processing is becoming an increasingly important workload on Hadoop clusters. While Hive is great at letting people use SQL to process large amounts of structured data on Hadoop, it is surprisingly inefficient. Researchers have recently argued that there is more than an order of magnitude inefficiency for relational-like workloads on Hive (see this paper). Clydesdale is a new way of tackling this problem. In a paper that was recently accepted to EDBT 2012, we showed that Clydesdale is 38x faster than Hive on the popular Star-Schema Benchmark.

Clydesdale is aimed at workloads where the data fits a star schema. A substantial fraction of structured data repositories follow either a star schema or a snowflake schema. Clydesdale is specialized to perform well for these, but it can work on any arbitrary schema. We draw on many well known techniques from the parallel DBMS world like column oriented storage, tailored join-plans, and multi-core execution strategies to achieve dramatic performance improvements over existing approaches. The best part is that Clydesdale works on a stock Hadoop installation -- no modifications are needed to the underlying implementation! Using the star schema benchmark, we show that Clydesdale is 5x to 89x faster for different queries, averaging a 38x advantage. I'll post a link to the paper with all the details as soon as I have the camera-ready version finalized for EDBT.

Here's a high-level overview of the architecture and the query processing strategy: The fact table is stored on HDFS using CIF, a columnar storage approach described in our earlier VLDB paper. (I've also previously discussed CIF here and here.) A master copy of the dimension tables is also stored in HDFS, additionally, they are also replicated to the local storage on each node in the cluster. SQL queries submitted to Clydesdale turn into MapReduce jobs. Join processing is done in the map phase: mappers apply predicates to the dimension tables and build hash tables in the setup phase, then, they join the rows of the fact table by probing the hash tables and emitting those that qualify the join. The reduce phase is responsible for grouping and aggregation. Clydesdale draws on several strategies for performance improvement: careful use of multi-core parallelism, employing a tailored star-join plan instead of joining tables two-by-two, columnar storage and -- to a limited extent -- columnar execution. None of these techniques is singularly responsible for the overall improvement -- each of them contribute to the 38x overall speedup we measured over Hive.

Clydesdale's central contribution is a demonstration that a simple architecture that leverages existing techniques from parallel DBMSs can provide substantial benefits for structured data processing on MapReduce. The experimental results demonstrate that for certain structured data processing workloads, it is not necessary to resort to more radical approaches like discarding HDFS and embedding a relational database (such as in HadoopDB); or requiring unusual storage organization and join processing techniques (such as in Llama). Clydesdale layers on an unmodified Hadoop installation and provides major performance improvements for structured data processing.