Sandeep's Research Notes

Saturday, July 30, 2011

Hadoop: A Disruptive Technology?

This is going to be a rather non-technical post ... you have been warned :-)

Clayton Christensen's thesis (Innovator's Dilemma) more than a decade ago was that rational management practices of listening to your customers, and building and investing in what solves your customer's problems can lead you astray in the context of disruptive technologies. I've heard the phrase "Hadoop is a disruptive technology" in various contexts. I wondered ... does it really hold true according to Christensen's classic definition?

For a technology to be "disruptive", it should satisfy three requirements

It is a simplifying technology -- cheaper/easier than established technologies
Commercialized first in emerging or insignificant markets
The market leader's biggest customers generally don't want to and initially can't use it

Simplifying Technology

Assuming that "big warehouse" vendors are the established leaders in the market, I'd guess that this is true. Big warehouses are messy and complicated to install, configure, tune, and get running. Hadoop is certainly simpler to install. Warehouses are usually deployed with non-commodity storage/networking. Hadoop runs on cheap commodity boxes. Is it simpler to use? That could go either way. If your problem fits in SQL, relational warehouses do an amazing job. Writing MapReduce programs is a good bit more work than writing a SQL query. Oh, and Hadoop is definitely cheaper :-) I'd give this a +0.75

Emerging Markets/ Value Networks

This is an easy +1. No one is clamoring to replace their big enterprise warehouse or even their big data marts with Hadoop. It is not clear how big the market is for Hadoop and related technologies. It isn't entirely clear what the market is. The early users have been big web companies that need to analyze web pages, search logs, and click logs. None of this data typically goes into a big relational warehouse. Monash has been arguing that the predominant driver for Hadoop has been the "big bit bucket" use case. Anant Jhingran alluded to many emerging use cases in enterprises -- most of these fall under the kinds of analytics that are/were not commonly done in traditional warehouses. The places where Hadoop is getting interest is not from the sort of enterprises/departments/groups who have typically bought warehouses to solve their data analysis needs.

Unattractive to Market Leaders and their Customers

Traditional enterprises don't and can't use Hadoop as their central data warehouse. Not today, and perhaps not in the near term either. The big central EDW is typically way more complicated than a typical Hadoop cluster - there is a complex ETL system feeding the beast, there are a bunch of tools producing and sending out reports, and there may be other ETL-off ramps feeding smaller data marts. Big EDW customers use complex SQL, many reporting tools that work with the warehouse, advanced features like views, triggers, and often even transactional updates. Big enterprises looking for an EDW solution simply cannot use Hadoop today. As a result, this is not the most attractive technology for big warehouse customers. Again, a +1.

That's 2.75/3. Neat. All this talk of Hadoop being a "disruptive technology" is not hot air after all. It does fit Christensen's definition rather well!

Sunday, July 24, 2011

RCFile and CIF

The benefits of using columnar storage for relational data is well known. The "Column-store for Hadoop" paper I recently blogged about focuses on the advantages of using CIF for complex variable-length data types like maps, arrays, nested records etc. Not surprisingly, it turns out that even if the data is flat and well-structured, CIF has two advantages over RCFile:

When you read a dataset laid out in CIF, you end up touching fewer HDFS blocks than you would with RCFile.
It is possible to add derived columns in a way that they are co-located correctly with the existing data. With RCFile, as it stands, the only option is to re-load the entire dataset.

Let me explain the first point with some numbers. Assume you're reading a dataset with three columns: a 4-byte integer (X), a 100 byte string (Y), and an 8-byte float (Z). Further, assume that you have 100 million rows. That gives us 100M * (112) = 11200Mi bytes ~ 11.2 GB. Assuming 64MB blocks, the data will be laid out over ~175 blocks when using RCFile. I'm completely ignoring the metadata overheads.

Suppose that you were only reading X, ideally, you'd want to read 4 * 100M ~ 400MB of data. With RCFile, you'd open each of the 175 blocks, read a small portion of the data there (~2.28MB) for X, then move on to the next block. With the CIF layout where you put each column in a separate file, you'd only read
~7 HDFS blocks to read all the X values. If you were only scanning Y, this would be 157 blocks, and for Z, 13 blocks. If the actual processing you were doing on the data were simple, such as just adding up the numbers -- which is typically the case for relational workloads, the per-block overheads (opening the block, interpreting the RCFile headers, seeking to the right place, etc.) become very substantial very quickly. As a result, CIF starts to have a major advantage over RCFile for wide rows where only a small number of columns are being scanned.

The second advantage is easier to understand. Given that HDFS is an append-only file system, adding derived columns while guaranteeing locality for the three replicas is very difficult with RCFile. Consider that you wanted to add a new column T to the dataset above that was computed by analyzing the string Y. An example would be some kind of feature extractor applied on a URL field. Now, this field T needs to be stored on the same nodes that the (X,Y,Z) columns for the row are stored on. The only way to guarantee this with the current design of RCFile is to re-write the file with (X,Y,Z,T) so that all the columns are organized and laid out correctly in the same HDFS block.

With CIF, the above is easily accomplished by adding the T data in a separate file and placing it next to the X,Y,Z files in each of the split directories of the dataset (Section 4.2 in the paper explains split-directories.) The RCFile design will need to be changed significantly using the ideas in CIF to accommodate such a feature.

The other side: RCFile advantages

The biggest advantage of RCFile is its simplicity. It fits beautifully within HDFS's constraints. CIF requires changing the HDFS placement policy (possible Hadoop-0.21.0 onwards) and also requires tracking a bunch of additional metadata. It also ends up creating a larger number of files, which could cause additional pressure on the namenode. However, for many deployments this should not be a major issue. There's a detailed discussion of this in the paper.

Friday, July 15, 2011

Spinnaker, Paxos, and Gaios

Here's a great paper from Bolosky and others at Microsoft that demonstrates that Paxos can indeed be used as a high-volume replication protocol. They argue that a Paxos-based storage service can provide performance close to what the underlying hardware can support. You don't have to resort to simple primary-backup schemes which make it difficult to deal with arbitrary machine restarts. Also, you don't have to give up sequential consistency for performance and deal with the complications of eventual consistency. The crux of their argument is: for a system that is in a single datacenter, and needs to use commodity networking and disks, the Paxos implementation will certainly not be the bottleneck.

They implemented this in the context of a storage system called Gaios. The paper has plenty of implementation details and performance results. They even ran an OLTP benchmark on SQL Server configured to use Gaios storage. Neat stuff!

Spinnaker exploits the same ideas as Gaios, but the exposes a user-programmable key-value store API instead of building scale-out storage. The results from Gaios independently verify the arguments we tried to make in the Spinnaker paper -- you can use a consensus algorithm for data replication in a scale-out system without sacrificing performance.

Thursday, July 14, 2011

Column Stores and Hadoop

Switching gears a bit from the NoSQL to the Hadoop world ... here's a quick preview of some work we did on storage organization on Hadoop. We started this work to investigate how a columnar storage layer could be implemented for Hadoop and if it would lead to any insights that weren't already known in the context of parallel DBMSs. It turned up some pretty interesting results.

First, we built an InputFormat/OutputFormat pair on Hadoop v-0.21 that uses some of the new APIs for a pluggable BlockPlacementPolicy. We gave it a rather inventive name -- CIF and COF-- for ColumnInputFormat and ColumnOutputFormat :-) Instead of using a PAX-like layout with RCFile, CIF lets you you true columnar storage where each column is stored in a separate file. As one would expect, when you scan only a small number of columns from a much wider dataset, CIF eliminates the I/O for the unnecessary columns and improves your map-phase performance compared to SequenceFiles and RCFile.

Second, most of the literature on columnar processing in DBMSs usually deals with atomic types -- we looked at more complex types that are typically used in the Hadoop environment -- lists, maps, and nested records. These are all typically variable length columns. Being able to automatically avoid deserialization for these complex types when they're not accessed during the MapReduce job turns out to be a nice performance boost.

We combined the two ideas above to build a novel skip-list based Input/OutputFormat called CIF-SL. It lets you eliminate I/O for columns that you don't need in a job. Also, it lets you use lazily decompress and deserialize columns that are only accessed for some of the records during a MapReduce job. This can lead to substantial CPU savings. If it turns out that if you access a particular column for only a small fraction of the records you scan,CIF-SF can provide a pretty substantial advantage over plain CIF.

We wrote up these findings in a paper for PVLDB with details of the APIs, examples, an implementation sketch, and a performance evaluation. This paper is much simpler than Spinnaker paper, and a much more fun read :) Avrilia Floratou, who did most of the engineering for this project during her internship at Almaden, will present this at VLDB this fall.

Saturday, July 9, 2011

HBase, Hadoop, and Facebook Messages

If you were at the Hadoop summit and watched Karthik Ranganathan’s talk, you know how Facebook now uses HBase to store all the inbox messages. It was a fun talk and Karthik shared many of the hard-won lessons from deploying and maintaining a large application on HBase and Hadoop/MapReduce. Here are a few observations:

Karthik and the rest of his team at Facebook didn’t choose Cassandra even though they have the necessary skills and expertise at Facebook to do this. In fact, Cassandra was originally built by Facebook engineers to store and index their inbox messages. Why? In fact, one of the audience members also asked this question. As we argued in the Spinnaker paper, you don’t want to force your application developer to deal with eventual consistency unless your availability needs cannot be met any other way. Writing automatic conflict resolution handlers is messy work, and as your application gets more complicated than a shopping cart, it is increasingly more difficult to reason about it. HBase with its easier consistency model likely made life much easier for the team building the messages service. Karthik's response was essentially the same.

If Spinnaker was available, would the Facebook guys have used it instead? I’d like to think the answer is yes :-) . Like I explained in a previous blog post, Spinnaker provides a similar consistency model as HBase and better availability and performance. In fact, we talked to Karthik and his team in the early days when they were planning the migration to HBase. These guys are smart engineers – they liked the ideas in Spinnaker, and they asked the right questions: 1. Is this open source? 2. Is this battle-tested? 3. What happens when disks fails completely or blocks get corrupted? There’s a bunch of code in HDFS that HBase gets to leverage for robustness to such failures. Spinnaker is/was just a research prototype and would require substantial engineering effort to get it ready for a production app like FB’s messages. But there's a simpler answer for why HBase is a good fit here and Spinnaker would not be needed even if it were available ... read on.

The availability guarantees provided by HBase are probably sufficient for the Facebook messages app. Let's do some quick calculations: on a cluster with about 100 nodes, if you expect that a regionserver crash occurs every 30 days, and conservatively, it takes about 10 minutes to recover, the availability of a row in HBase is approximately: 0.999768. (I'm just using 1 - MTTR/MTTF and making a whole bunch of simplifying assumptions.) If all the data for a user is stuffed into a single row, which seemed to be the case in FB’s design, this is perfectly acceptable availability: for a user that visited his message box twice every day, this would mean 0.169 "Not Available" errors a year. Pretty reasonable.

On the other hand, ff you have a more complex application, and you needed access to *all the rows* to support your application, the availability of your application would be 0. 999768 ^ 100 = 0.977115. Although this looks close enough to the previous number, this is really bad. The same user that visited his inbox twice a day would see about 16 "Not Available" errors a year. That's more than an error every month. That could make for unhappy users. By designing the application appropriately, the Facebook guys were able to fit well within the availability guarantees that HBase can provide. Nice job!

Spinnaker still has other performance advantages ... but that's for another blog post :-)

Thursday, June 30, 2011

2011 Hadoop Summit

I was at the Hadoop Summit in Santa Clara today. It was fun. The long rumored Hadoop spin-off was confirmed yesterday -- a bunch of the core infrastructure team under Eric14 is now a separate company called HortonWorks. Their current business plan is to sell training, consulting, and services around Apache Hadoop.

The post surprising part of the summit was the sheer number of new VC funded startups that were there. Last year, the major ones were Cloudera, Datameer, and Pentaho. This year a host of new companies appeared: MapR, Zettaset, Arista, PervasiveDataRush, SyncSort, and DataStax (previously Riptano).

One observation I had was that there seems to be a flurry of activity in the scalable PubSub system space. At least five systems were discussed/mentioned:

Kafka from LinkedIn
Scribe from Facebook
Flume from Cloudera
Hedwig from Yahoo
Data Highway (Yahoo)

Of these I have a basic understanding of Kafka and Hedwig, but I haven't used either.They seem to have made different design choices. Kafka is aimed squarely at log collection ... they argue that some of their API choices were better than the ones offered by Scribe. Hedwig seems closer to a true scale-out queuing system with guaranteed in-order at-least-once delivery. It'll be fun to do an in-depth comparison at some point.

Friday, June 24, 2011

An Informal Availability Comparison Between Hbase and Spinnaker

An important design goal in scale-out structured storage systems (“NoSQL” systems) is availability. Availability is usually defined as 1 – (MTTR/(MTTF + MTTR)). MTTR, the Mean-Time-To-Repair, is the duration for which a data item is not available for reads or writes when a node fails. MTTF is the Mean-Time-To-Failure.

We did a simple experiment measuring MTTR for both Spinnaker and HBase and learned some interesting stuff, but we didn’t have room for it in the paper.

To compare HBase and Spinnaker, we used the following setup: a single client picks a random node in the cluster and starts writing data. Consecutive keys are used and a single random value of 1KB is written. After a predetermined amount of data is written, we trigger a failure of the node that was processing the writes for this key-range. We do this by killing the leader of the cohort in Spinnaker. For HBase, we killed the regionserver process for the key-range.

We measured the time taken between the first failed write and the first successful write after that as the unavailability window experienced by the client. In both cases, we split this unavailability window into two parts: the time taken to detect a failure and the time taken to actually recover. Both systems exploit Zookeeper for failure detection. The average time to detect a failure can be configured by adjusting the Zookeeper session timeouts and is set to 2 seconds. In the experiment below, we subtract the failure detection times and only report the actual recovery times.

The picture above shows that for HBase, the recovery time grows linearly with the amount of data that was written to the system before a failure occurred. For Spinnaker, the recovery time remains constant. Neat! This is not unexpected: when a node fails, the regions served by the failed regionserver become unavailable for both reads and writes. HBase needs to carry out multiple steps: First, the master node reads the log file (available in HDFS) of the failed regionserver and splits it into separate files for each region. Each of these regions is assigned to a healthy regionserver. These regionservers now replay the log file, construct the appropriate memtable, and make that region available for reads and writes. The total recovery time includes the time taken to first split the log, and then replay it before opening up for updates. This is clearly proportional to the size of the log. This number can be decreased by aggressively flushing the in-memory data structures to on-disk SSTables more frequently (decreasing the memtable flush limit) and checkpointing. While this will reduce the unavailability window, it severely affects write performance. Typical configurations suggest using around half the available memory for the memtables. With 16 gigabytes of memory on each server, the unavailability window on failure is over 13 minutes. As the available memory on a server increases, this window gets longer.

In the case of Spinnaker, the ranges on a failed master become

unavailable for writes until a new master is elected. The replication protocol in Spinnaker ensures that the new master only needs to re-propose the values since the last-known committed value. This is often less than 1 second’s worth of data, and is independent of the size of the log. As the picture shows, the unavailability window of Spinnaker is relatively constant irrespective of the amount of data written.

Spinnaker continues to be available to service weak reads from the other members in the cohort. Strongly consistent reads are serviced by the master, and therefore are unavailable until a new master is elected, just like in the case of writes. In contrast, HBase is unavailable for both reads and writes. Although, I suspect it wouldn’t be too difficult to modify HBase to return a stale read while the recovery is running.

This interesting property is a consequence of the fact that the HBase design does not provide for a “warm standby” and delegates replication and consistency entirely to the filesystem (HDFS). While I’m not aware of any published work, I suspect there are interesting ways in which HDFS and the HBase design can be modified reduce the unavailability window when a node fails.

Eventually consistent systems like Cassandra continue to be available for both reads and writes after failures by sacrificing consistency. By definition, their MTTR is 0 as long as any node in the system is alive, and therefore the availability is 100%. We did not include Cassandra in this experiment.