Sandeep's Research Notes: sql

Showing posts with label sql. Show all posts

Monday, March 12, 2012

Clydesdale Updates

The Hadoop Summit is coming up in June in San Jose. Every year, this event gets bigger and more interesting.
I've submitted an abstract to talk about Clydesdale and structured data processing on Hadoop in general. You can vote for it here: https://hadoopsummit2012.uservoice.com/forums/151413-track-1-future-of-apache-hadoop/suggestions/2663590-structured-data-processing-on-hadoop

In other news, our demonstration proposal for Clydesdale got accepted at SIGMOD. If you're going to be at the conference in Arizona in May this year, drop by for a chat and for the latest updates on what we have cooking in this space.

Wednesday, February 8, 2012

Clydesdale Overview

Here's a quick preview of the architecture of Clydesdale and details on its performance. I'll link to the camera-ready version of the EDBT paper soon. The big-picture idea in Clydesdale is the same as Hive: a SQL-like interface for processing structured data on Hadoop. The target space where this is likely to be most useful is large investigative marts. (See my older post and this post from Monash for backgroud.)

The clients submit a SQL query to Clydesdale, the compiler turns it into one or more MapReduce jobs. (There are ways to access Clydesdale using more programmatic APIs, but that's for another post :) )The fact table is on HDFS, stored using CIF. A master copy of the dimension tables is available in HDFS. Dimension tables are also cached on the local storage of each node. The picture below shows the general architecture -- Clydesdale layers neatly on top of unmodified Hadoop.

Clydesdale is focused on processing queries over star schema data sets. It uses several techniques like columnar storage, block iteration, multi-core aware operators, and a multi-way map-side join plan. This combination of techniques adds up to a massive advantage over Hive, especially for the star schema benchmark. The figure below shows the execution time for each of the queries in the benchmark using 1) Clydesdale, 2) the repartition join strategy in Hive, and 3) the map-join strategy in Hive. Depending on the query, Clydesdale is 5x to 83x faster than the best Hive plan, averaging about 38x across the benchmark. These are numbers on a 9-node cluster running the star-schema benchmark at scale factor 1000. The general trend holds for larger clusters and larger datasets.

To be fair, Hive is general purpose and not really focused on doing well on star-schema workloads. The comparison is not intended as a criticism of Hive, but as evidence for how much performance potential is available on the Hadoop platform for structured data processing! While Hadoop was originally designed for large parallel programming tasks that were more compute-intensive (eg. analysing web pages and building search indexes), the platform is not so bad when it comes to structured data processing. The ideas in Clydesdale can be used to tremendously improve the performance of Hive and similar systems.

What does this mean for the larger Hadoop ecosystem? This is very good news for BI (Business Intelligence) vendors on Hadoop like Pentaho, Datameer, Platfora, and others. If you rely on SQL-like processing to deliver business insights, and the data lies in Hadoop, we can now do a whole lot better than today's Hive performance!

Tuesday, January 10, 2012

Clydesdale: SQL on Hadoop

Clydesdale is an experimental system we built at Almaden for processing structured data on Hadoop. It is obvious that structured data processing is becoming an increasingly important workload on Hadoop clusters. While Hive is great at letting people use SQL to process large amounts of structured data on Hadoop, it is surprisingly inefficient. Researchers have recently argued that there is more than an order of magnitude inefficiency for relational-like workloads on Hive (see this paper). Clydesdale is a new way of tackling this problem. In a paper that was recently accepted to EDBT 2012, we showed that Clydesdale is 38x faster than Hive on the popular Star-Schema Benchmark.

Clydesdale is aimed at workloads where the data fits a star schema. A substantial fraction of structured data repositories follow either a star schema or a snowflake schema. Clydesdale is specialized to perform well for these, but it can work on any arbitrary schema. We draw on many well known techniques from the parallel DBMS world like column oriented storage, tailored join-plans, and multi-core execution strategies to achieve dramatic performance improvements over existing approaches. The best part is that Clydesdale works on a stock Hadoop installation -- no modifications are needed to the underlying implementation! Using the star schema benchmark, we show that Clydesdale is 5x to 89x faster for different queries, averaging a 38x advantage. I'll post a link to the paper with all the details as soon as I have the camera-ready version finalized for EDBT.

Here's a high-level overview of the architecture and the query processing strategy: The fact table is stored on HDFS using CIF, a columnar storage approach described in our earlier VLDB paper. (I've also previously discussed CIF here and here.) A master copy of the dimension tables is also stored in HDFS, additionally, they are also replicated to the local storage on each node in the cluster. SQL queries submitted to Clydesdale turn into MapReduce jobs. Join processing is done in the map phase: mappers apply predicates to the dimension tables and build hash tables in the setup phase, then, they join the rows of the fact table by probing the hash tables and emitting those that qualify the join. The reduce phase is responsible for grouping and aggregation. Clydesdale draws on several strategies for performance improvement: careful use of multi-core parallelism, employing a tailored star-join plan instead of joining tables two-by-two, columnar storage and -- to a limited extent -- columnar execution. None of these techniques is singularly responsible for the overall improvement -- each of them contribute to the 38x overall speedup we measured over Hive.

Clydesdale's central contribution is a demonstration that a simple architecture that leverages existing techniques from parallel DBMSs can provide substantial benefits for structured data processing on MapReduce. The experimental results demonstrate that for certain structured data processing workloads, it is not necessary to resort to more radical approaches like discarding HDFS and embedding a relational database (such as in HadoopDB); or requiring unusual storage organization and join processing techniques (such as in Llama). Clydesdale layers on an unmodified Hadoop installation and provides major performance improvements for structured data processing.

Thursday, November 17, 2011

Analytic DBMS Landscape

Curt Monash has a great post describing eight kinds of analytic DBMSes. Here's a picture that presents a simplified view, but I highly recommend the original post. There are three major groups: the big enterprise data warehouses (EDWs), various kinds of data marts, and what he calls the "bit bucket" category -- which is where he places Hadoop. I've left out the "archival store" and the "operational analytic server" which was essentially a catch-all category.

Let's look at Hadoop and related technologies -- Monash argues that this is the "big bit bucket" category. A place to gather large amounts of data cheaply and do some analytics .This is the kind of data that you'd normally consider too low-value to justify putting it inside your gold-plated EDW. As Hadoop gets better at TCO, performance, and adds better support for reporting, high-level analytics, and workload management -- how is it going to affect the other spaces? I've argued before that Hadoop is not going to be the next platform for EDWs for a long time to come.

One way to look at an improved Hadoop is as a bigger, more reliable, lower cost player in the "large investigative data mart" space. The investigative analytics aspect is likely to be Hadoop's major advantage along with low cost. Cheaper storage, better programmability, great fault-tolerance, and eventually reasonably good efficiency is going to make for a very interesting entry in the "large investigative data mart" space.

Traditional data marts are really good at being a data mart! Hadoop is a long way off from competing with those -- but there is probably a segment of this market that values elasticity and low-cost. My guess is startups like Platfora are going to target this space. I suspect very high scale marts that are not very demanding when it comes to concurrency and performance will very soon be better served by Hadoop-based solutions.

The unlikely category of "outsourced data marts" is actually quite promising. As workload management improves, it might be possible to consolidate large outsourced marts for reporting on Hadoop. Individual marts that are > 20PB are unlikely to be a large enough market (at least, not in the short term), but businesses that offer analytics on large datasets -- advertising tracking, clickstream analytics for customer modeling, website optimization, etc -- they will probably want a highly scalable, elastic, multi-tenant, low-cost mart.

Tuesday, November 15, 2011

Big Data and Hardware Trends

One of the common refrains from "real" data warehouse folks when it comes to data processing on Hadoop is the assumption that a system built in Java is going to be extremely inefficient for many tasks. This is quite true today. However, for managing large datasets, I wonder if that's the right way to look at it. CPU cycles available per dollar is decreasing more rapidly than sequential bandwidth available per dollar. Dave Patterson observed this a long time ago. He noted that the annual improvement rate for CPU cycles is about 1.5x and the improvement rate for disk bandwidth was about 1.3x ... and there seems to be evidence that the disk bandwidth improvements are tapering off.

While SSDs remain expensive for sheer capacity, disks will continue to store most of the petabytes of "big data". A standard 2U server can only hold so many disks .. you max out at about 16 2.5" disks or about 8-10 3.5" disks. This gives us about 1500MB/s of aggregate sequential bandwidth. You can stuff 48 cores into a 2U server today. This number will probably be around 200 cores in 3-4 years. Even with improvements in disk bandwidth, I'd be surprised if we could get more than 2000MB/s in 3-4 years using disks. In the meantime, the JVMs will likely close the gap in CPU efficiency. In other words, the penalty for being in Java and therefore CPU inefficient is likely to get progressively smaller.

There are lots of assumptions in the arguments above, but if they line up, it means that Hadoop-based data management solutions might start to get competitive with current SQL engines for simple workloads on very large datasets.The Tenzing folks were claiming that this is already the case with a C/C++ based MapReduce engine.

Tuesday, October 4, 2011

Tenzing: SQL on MapReduce

Google described Tenzing, their implementation of SQL on MapReduce at VLDB this year. The paper explains that they built it because their data-warehouse was getting too expensive and was not able to keep pace with the demands the users made -- in terms of complex queries (SQL was always getting in the way), and performance (the ETL process was often a big bottleneck). Neither argument is particularly surprising, although there are examples of successful PB-sized warehouses in the industry.

One of the most interesting parts of the paper was a laundry list of things they had to fix in MapReduce to get better performance for SQL:

Workerpools: These are essentially long running processes that do various parts of the MapReduce job (map task, reduce task, job coordinator, etc). Having a pool of running processes makes latencies lower than they would be if you had to launch a binary for each task in the job. This is certainly the case with JVM launches in Hadoop. Hadoop gets part of this done with reusable JVMs. The tradeoff, of course, is that fault isolation becomes a messier proposition.
Streaming and In-Memory Chaining: Allows two MapReduce jobs to communicate without temping to disk (GFS). I wonder if this can be done easily with just some InputFormat/OutputFormat magic ... I suspect this is do-able with some thought. Memory-chaining allows a mapper and a reducer to be co-located in the same process. This is probably going to be a bit harder to do in Hadoop.
Sort Avoidance: This feature allowed you to tell MapReduce to shuffle, but not sort. I've seen the need for this in many applications. Again, makes perfect sense for Hadoop also.
Block Shuffle: For smaller rows, when sorting is not needed, the block shuffle reduces overheads in the shuffle phase. This is a performance opportunity opened up by sort avoidance.

The other interesting bit was the section on their execution engine. Their SQL-Sawzall implementation was slow because of all the deserialization and serialization costs associated with moving data in and out of Sawzall's type system. The second version used Dremel's SQL expression evaluation engine -- this was better than the Sawzall engine, but still had inefficiencies. The most promising version was the LLVM-based experimental engine that could operate natively on columnar data, and this, not surprisingly, provided the best performance.

What does this mean for the Apache Hadoop ecosystem? I'm guessing the biggest performance opportunities for Hive currently lie in making the execution engine more CPU efficient. In fact, an experimental branch of Hive that does this would probably be a really fun open-source project right now. Hadoop will likely incorporate some of the performance improvements described in Tenzing over time, and Hive should be able to ride those when they become available.