Sandeep's Research Notes: tenzing

Showing posts with label tenzing. Show all posts

Wednesday, February 8, 2012

Clydesdale Overview

Here's a quick preview of the architecture of Clydesdale and details on its performance. I'll link to the camera-ready version of the EDBT paper soon. The big-picture idea in Clydesdale is the same as Hive: a SQL-like interface for processing structured data on Hadoop. The target space where this is likely to be most useful is large investigative marts. (See my older post and this post from Monash for backgroud.)

The clients submit a SQL query to Clydesdale, the compiler turns it into one or more MapReduce jobs. (There are ways to access Clydesdale using more programmatic APIs, but that's for another post :) )The fact table is on HDFS, stored using CIF. A master copy of the dimension tables is available in HDFS. Dimension tables are also cached on the local storage of each node. The picture below shows the general architecture -- Clydesdale layers neatly on top of unmodified Hadoop.

Clydesdale is focused on processing queries over star schema data sets. It uses several techniques like columnar storage, block iteration, multi-core aware operators, and a multi-way map-side join plan. This combination of techniques adds up to a massive advantage over Hive, especially for the star schema benchmark. The figure below shows the execution time for each of the queries in the benchmark using 1) Clydesdale, 2) the repartition join strategy in Hive, and 3) the map-join strategy in Hive. Depending on the query, Clydesdale is 5x to 83x faster than the best Hive plan, averaging about 38x across the benchmark. These are numbers on a 9-node cluster running the star-schema benchmark at scale factor 1000. The general trend holds for larger clusters and larger datasets.

To be fair, Hive is general purpose and not really focused on doing well on star-schema workloads. The comparison is not intended as a criticism of Hive, but as evidence for how much performance potential is available on the Hadoop platform for structured data processing! While Hadoop was originally designed for large parallel programming tasks that were more compute-intensive (eg. analysing web pages and building search indexes), the platform is not so bad when it comes to structured data processing. The ideas in Clydesdale can be used to tremendously improve the performance of Hive and similar systems.

What does this mean for the larger Hadoop ecosystem? This is very good news for BI (Business Intelligence) vendors on Hadoop like Pentaho, Datameer, Platfora, and others. If you rely on SQL-like processing to deliver business insights, and the data lies in Hadoop, we can now do a whole lot better than today's Hive performance!

Tuesday, November 15, 2011

Big Data and Hardware Trends

One of the common refrains from "real" data warehouse folks when it comes to data processing on Hadoop is the assumption that a system built in Java is going to be extremely inefficient for many tasks. This is quite true today. However, for managing large datasets, I wonder if that's the right way to look at it. CPU cycles available per dollar is decreasing more rapidly than sequential bandwidth available per dollar. Dave Patterson observed this a long time ago. He noted that the annual improvement rate for CPU cycles is about 1.5x and the improvement rate for disk bandwidth was about 1.3x ... and there seems to be evidence that the disk bandwidth improvements are tapering off.

While SSDs remain expensive for sheer capacity, disks will continue to store most of the petabytes of "big data". A standard 2U server can only hold so many disks .. you max out at about 16 2.5" disks or about 8-10 3.5" disks. This gives us about 1500MB/s of aggregate sequential bandwidth. You can stuff 48 cores into a 2U server today. This number will probably be around 200 cores in 3-4 years. Even with improvements in disk bandwidth, I'd be surprised if we could get more than 2000MB/s in 3-4 years using disks. In the meantime, the JVMs will likely close the gap in CPU efficiency. In other words, the penalty for being in Java and therefore CPU inefficient is likely to get progressively smaller.

There are lots of assumptions in the arguments above, but if they line up, it means that Hadoop-based data management solutions might start to get competitive with current SQL engines for simple workloads on very large datasets.The Tenzing folks were claiming that this is already the case with a C/C++ based MapReduce engine.

Tuesday, October 4, 2011

Tenzing: SQL on MapReduce

Google described Tenzing, their implementation of SQL on MapReduce at VLDB this year. The paper explains that they built it because their data-warehouse was getting too expensive and was not able to keep pace with the demands the users made -- in terms of complex queries (SQL was always getting in the way), and performance (the ETL process was often a big bottleneck). Neither argument is particularly surprising, although there are examples of successful PB-sized warehouses in the industry.

One of the most interesting parts of the paper was a laundry list of things they had to fix in MapReduce to get better performance for SQL:

Workerpools: These are essentially long running processes that do various parts of the MapReduce job (map task, reduce task, job coordinator, etc). Having a pool of running processes makes latencies lower than they would be if you had to launch a binary for each task in the job. This is certainly the case with JVM launches in Hadoop. Hadoop gets part of this done with reusable JVMs. The tradeoff, of course, is that fault isolation becomes a messier proposition.
Streaming and In-Memory Chaining: Allows two MapReduce jobs to communicate without temping to disk (GFS). I wonder if this can be done easily with just some InputFormat/OutputFormat magic ... I suspect this is do-able with some thought. Memory-chaining allows a mapper and a reducer to be co-located in the same process. This is probably going to be a bit harder to do in Hadoop.
Sort Avoidance: This feature allowed you to tell MapReduce to shuffle, but not sort. I've seen the need for this in many applications. Again, makes perfect sense for Hadoop also.
Block Shuffle: For smaller rows, when sorting is not needed, the block shuffle reduces overheads in the shuffle phase. This is a performance opportunity opened up by sort avoidance.

The other interesting bit was the section on their execution engine. Their SQL-Sawzall implementation was slow because of all the deserialization and serialization costs associated with moving data in and out of Sawzall's type system. The second version used Dremel's SQL expression evaluation engine -- this was better than the Sawzall engine, but still had inefficiencies. The most promising version was the LLVM-based experimental engine that could operate natively on columnar data, and this, not surprisingly, provided the best performance.

What does this mean for the Apache Hadoop ecosystem? I'm guessing the biggest performance opportunities for Hive currently lie in making the execution engine more CPU efficient. In fact, an experimental branch of Hive that does this would probably be a really fun open-source project right now. Hadoop will likely incorporate some of the performance improvements described in Tenzing over time, and Hive should be able to ride those when they become available.