Sandeep's Research Notes: July 2011

Saturday, July 30, 2011

Hadoop: A Disruptive Technology?

This is going to be a rather non-technical post ... you have been warned :-)

Clayton Christensen's thesis (Innovator's Dilemma) more than a decade ago was that rational management practices of listening to your customers, and building and investing in what solves your customer's problems can lead you astray in the context of disruptive technologies. I've heard the phrase "Hadoop is a disruptive technology" in various contexts. I wondered ... does it really hold true according to Christensen's classic definition?

For a technology to be "disruptive", it should satisfy three requirements

It is a simplifying technology -- cheaper/easier than established technologies
Commercialized first in emerging or insignificant markets
The market leader's biggest customers generally don't want to and initially can't use it

Simplifying Technology

Assuming that "big warehouse" vendors are the established leaders in the market, I'd guess that this is true. Big warehouses are messy and complicated to install, configure, tune, and get running. Hadoop is certainly simpler to install. Warehouses are usually deployed with non-commodity storage/networking. Hadoop runs on cheap commodity boxes. Is it simpler to use? That could go either way. If your problem fits in SQL, relational warehouses do an amazing job. Writing MapReduce programs is a good bit more work than writing a SQL query. Oh, and Hadoop is definitely cheaper :-) I'd give this a +0.75

Emerging Markets/ Value Networks

This is an easy +1. No one is clamoring to replace their big enterprise warehouse or even their big data marts with Hadoop. It is not clear how big the market is for Hadoop and related technologies. It isn't entirely clear what the market is. The early users have been big web companies that need to analyze web pages, search logs, and click logs. None of this data typically goes into a big relational warehouse. Monash has been arguing that the predominant driver for Hadoop has been the "big bit bucket" use case. Anant Jhingran alluded to many emerging use cases in enterprises -- most of these fall under the kinds of analytics that are/were not commonly done in traditional warehouses. The places where Hadoop is getting interest is not from the sort of enterprises/departments/groups who have typically bought warehouses to solve their data analysis needs.

Unattractive to Market Leaders and their Customers

Traditional enterprises don't and can't use Hadoop as their central data warehouse. Not today, and perhaps not in the near term either. The big central EDW is typically way more complicated than a typical Hadoop cluster - there is a complex ETL system feeding the beast, there are a bunch of tools producing and sending out reports, and there may be other ETL-off ramps feeding smaller data marts. Big EDW customers use complex SQL, many reporting tools that work with the warehouse, advanced features like views, triggers, and often even transactional updates. Big enterprises looking for an EDW solution simply cannot use Hadoop today. As a result, this is not the most attractive technology for big warehouse customers. Again, a +1.

That's 2.75/3. Neat. All this talk of Hadoop being a "disruptive technology" is not hot air after all. It does fit Christensen's definition rather well!

Sunday, July 24, 2011

RCFile and CIF

The benefits of using columnar storage for relational data is well known. The "Column-store for Hadoop" paper I recently blogged about focuses on the advantages of using CIF for complex variable-length data types like maps, arrays, nested records etc. Not surprisingly, it turns out that even if the data is flat and well-structured, CIF has two advantages over RCFile:

When you read a dataset laid out in CIF, you end up touching fewer HDFS blocks than you would with RCFile.
It is possible to add derived columns in a way that they are co-located correctly with the existing data. With RCFile, as it stands, the only option is to re-load the entire dataset.

Let me explain the first point with some numbers. Assume you're reading a dataset with three columns: a 4-byte integer (X), a 100 byte string (Y), and an 8-byte float (Z). Further, assume that you have 100 million rows. That gives us 100M * (112) = 11200Mi bytes ~ 11.2 GB. Assuming 64MB blocks, the data will be laid out over ~175 blocks when using RCFile. I'm completely ignoring the metadata overheads.

Suppose that you were only reading X, ideally, you'd want to read 4 * 100M ~ 400MB of data. With RCFile, you'd open each of the 175 blocks, read a small portion of the data there (~2.28MB) for X, then move on to the next block. With the CIF layout where you put each column in a separate file, you'd only read
~7 HDFS blocks to read all the X values. If you were only scanning Y, this would be 157 blocks, and for Z, 13 blocks. If the actual processing you were doing on the data were simple, such as just adding up the numbers -- which is typically the case for relational workloads, the per-block overheads (opening the block, interpreting the RCFile headers, seeking to the right place, etc.) become very substantial very quickly. As a result, CIF starts to have a major advantage over RCFile for wide rows where only a small number of columns are being scanned.

The second advantage is easier to understand. Given that HDFS is an append-only file system, adding derived columns while guaranteeing locality for the three replicas is very difficult with RCFile. Consider that you wanted to add a new column T to the dataset above that was computed by analyzing the string Y. An example would be some kind of feature extractor applied on a URL field. Now, this field T needs to be stored on the same nodes that the (X,Y,Z) columns for the row are stored on. The only way to guarantee this with the current design of RCFile is to re-write the file with (X,Y,Z,T) so that all the columns are organized and laid out correctly in the same HDFS block.

With CIF, the above is easily accomplished by adding the T data in a separate file and placing it next to the X,Y,Z files in each of the split directories of the dataset (Section 4.2 in the paper explains split-directories.) The RCFile design will need to be changed significantly using the ideas in CIF to accommodate such a feature.

The other side: RCFile advantages

The biggest advantage of RCFile is its simplicity. It fits beautifully within HDFS's constraints. CIF requires changing the HDFS placement policy (possible Hadoop-0.21.0 onwards) and also requires tracking a bunch of additional metadata. It also ends up creating a larger number of files, which could cause additional pressure on the namenode. However, for many deployments this should not be a major issue. There's a detailed discussion of this in the paper.

Friday, July 15, 2011

Spinnaker, Paxos, and Gaios

Here's a great paper from Bolosky and others at Microsoft that demonstrates that Paxos can indeed be used as a high-volume replication protocol. They argue that a Paxos-based storage service can provide performance close to what the underlying hardware can support. You don't have to resort to simple primary-backup schemes which make it difficult to deal with arbitrary machine restarts. Also, you don't have to give up sequential consistency for performance and deal with the complications of eventual consistency. The crux of their argument is: for a system that is in a single datacenter, and needs to use commodity networking and disks, the Paxos implementation will certainly not be the bottleneck.

They implemented this in the context of a storage system called Gaios. The paper has plenty of implementation details and performance results. They even ran an OLTP benchmark on SQL Server configured to use Gaios storage. Neat stuff!

Spinnaker exploits the same ideas as Gaios, but the exposes a user-programmable key-value store API instead of building scale-out storage. The results from Gaios independently verify the arguments we tried to make in the Spinnaker paper -- you can use a consensus algorithm for data replication in a scale-out system without sacrificing performance.

Thursday, July 14, 2011

Column Stores and Hadoop

Switching gears a bit from the NoSQL to the Hadoop world ... here's a quick preview of some work we did on storage organization on Hadoop. We started this work to investigate how a columnar storage layer could be implemented for Hadoop and if it would lead to any insights that weren't already known in the context of parallel DBMSs. It turned up some pretty interesting results.

First, we built an InputFormat/OutputFormat pair on Hadoop v-0.21 that uses some of the new APIs for a pluggable BlockPlacementPolicy. We gave it a rather inventive name -- CIF and COF-- for ColumnInputFormat and ColumnOutputFormat :-) Instead of using a PAX-like layout with RCFile, CIF lets you you true columnar storage where each column is stored in a separate file. As one would expect, when you scan only a small number of columns from a much wider dataset, CIF eliminates the I/O for the unnecessary columns and improves your map-phase performance compared to SequenceFiles and RCFile.

Second, most of the literature on columnar processing in DBMSs usually deals with atomic types -- we looked at more complex types that are typically used in the Hadoop environment -- lists, maps, and nested records. These are all typically variable length columns. Being able to automatically avoid deserialization for these complex types when they're not accessed during the MapReduce job turns out to be a nice performance boost.

We combined the two ideas above to build a novel skip-list based Input/OutputFormat called CIF-SL. It lets you eliminate I/O for columns that you don't need in a job. Also, it lets you use lazily decompress and deserialize columns that are only accessed for some of the records during a MapReduce job. This can lead to substantial CPU savings. If it turns out that if you access a particular column for only a small fraction of the records you scan,CIF-SF can provide a pretty substantial advantage over plain CIF.

We wrote up these findings in a paper for PVLDB with details of the APIs, examples, an implementation sketch, and a performance evaluation. This paper is much simpler than Spinnaker paper, and a much more fun read :) Avrilia Floratou, who did most of the engineering for this project during her internship at Almaden, will present this at VLDB this fall.

Saturday, July 9, 2011

HBase, Hadoop, and Facebook Messages

If you were at the Hadoop summit and watched Karthik Ranganathan’s talk, you know how Facebook now uses HBase to store all the inbox messages. It was a fun talk and Karthik shared many of the hard-won lessons from deploying and maintaining a large application on HBase and Hadoop/MapReduce. Here are a few observations:

Karthik and the rest of his team at Facebook didn’t choose Cassandra even though they have the necessary skills and expertise at Facebook to do this. In fact, Cassandra was originally built by Facebook engineers to store and index their inbox messages. Why? In fact, one of the audience members also asked this question. As we argued in the Spinnaker paper, you don’t want to force your application developer to deal with eventual consistency unless your availability needs cannot be met any other way. Writing automatic conflict resolution handlers is messy work, and as your application gets more complicated than a shopping cart, it is increasingly more difficult to reason about it. HBase with its easier consistency model likely made life much easier for the team building the messages service. Karthik's response was essentially the same.

If Spinnaker was available, would the Facebook guys have used it instead? I’d like to think the answer is yes :-) . Like I explained in a previous blog post, Spinnaker provides a similar consistency model as HBase and better availability and performance. In fact, we talked to Karthik and his team in the early days when they were planning the migration to HBase. These guys are smart engineers – they liked the ideas in Spinnaker, and they asked the right questions: 1. Is this open source? 2. Is this battle-tested? 3. What happens when disks fails completely or blocks get corrupted? There’s a bunch of code in HDFS that HBase gets to leverage for robustness to such failures. Spinnaker is/was just a research prototype and would require substantial engineering effort to get it ready for a production app like FB’s messages. But there's a simpler answer for why HBase is a good fit here and Spinnaker would not be needed even if it were available ... read on.

The availability guarantees provided by HBase are probably sufficient for the Facebook messages app. Let's do some quick calculations: on a cluster with about 100 nodes, if you expect that a regionserver crash occurs every 30 days, and conservatively, it takes about 10 minutes to recover, the availability of a row in HBase is approximately: 0.999768. (I'm just using 1 - MTTR/MTTF and making a whole bunch of simplifying assumptions.) If all the data for a user is stuffed into a single row, which seemed to be the case in FB’s design, this is perfectly acceptable availability: for a user that visited his message box twice every day, this would mean 0.169 "Not Available" errors a year. Pretty reasonable.

On the other hand, ff you have a more complex application, and you needed access to *all the rows* to support your application, the availability of your application would be 0. 999768 ^ 100 = 0.977115. Although this looks close enough to the previous number, this is really bad. The same user that visited his inbox twice a day would see about 16 "Not Available" errors a year. That's more than an error every month. That could make for unhappy users. By designing the application appropriately, the Facebook guys were able to fit well within the availability guarantees that HBase can provide. Nice job!

Spinnaker still has other performance advantages ... but that's for another blog post :-)