Sunday, July 24, 2011

RCFile and CIF

The benefits of using columnar storage for relational data is well known. The "Column-store for Hadoop" paper I recently blogged about focuses on the advantages of using CIF for complex variable-length data types like maps, arrays, nested records etc. Not surprisingly, it turns out that even if the data is flat and well-structured, CIF has two advantages over RCFile:
  1. When you read a dataset laid out in CIF, you end up touching fewer HDFS blocks than you would with RCFile. 
  2. It is possible to add derived columns in a way that they are co-located correctly with the existing data. With RCFile, as it stands, the only option is to re-load the entire dataset.
Let me explain the first point with some numbers. Assume you're reading a dataset with three columns: a 4-byte integer (X), a 100 byte string (Y), and an 8-byte float (Z).  Further, assume that you have 100 million rows. That gives us 100M * (112) = 11200Mi bytes ~ 11.2 GB. Assuming 64MB blocks, the data will be laid out over ~175 blocks when using RCFile. I'm completely ignoring the metadata overheads.

Suppose that you were only reading X, ideally, you'd want to read 4 * 100M ~ 400MB of data. With RCFile, you'd open each of the 175 blocks, read a small portion of the data there (~2.28MB) for X, then move on to the next block. With the CIF layout where you put each column in a separate file, you'd only read
~7 HDFS blocks to read all the X values. If you were only scanning Y, this would be 157 blocks, and for Z,  13 blocks. If the actual processing you were doing on the data were simple, such as just adding up the numbers -- which is typically the case for relational workloads, the per-block overheads (opening the block, interpreting the RCFile headers, seeking to the right place, etc.) become very substantial very quickly. As a result, CIF starts to have a major advantage over RCFile for wide rows where only a small number of columns are being scanned.

The second advantage is easier to understand. Given that HDFS is an append-only file system, adding derived columns while guaranteeing locality for the three replicas is very difficult with RCFile. Consider that you wanted to add a new column T to the dataset above that was computed by analyzing the string Y. An example would be some kind of feature extractor applied on a URL field. Now, this field T needs to be stored on the same nodes that the (X,Y,Z) columns for the row are stored on. The only way to guarantee this with the current design of RCFile is to re-write the file with (X,Y,Z,T) so that all the columns are organized and laid out correctly in the same HDFS block.

With CIF, the above is easily accomplished by adding the T data in a separate file and placing it next to the X,Y,Z files in each of the split directories of the dataset (Section 4.2 in the paper explains split-directories.) The RCFile design will need to be changed significantly using the ideas in CIF to accommodate such a feature.

The other side: RCFile advantages

The biggest advantage of RCFile is its simplicity. It fits beautifully within HDFS's constraints. CIF requires changing the HDFS placement policy (possible Hadoop-0.21.0 onwards) and also requires tracking a bunch of additional metadata. It also ends up creating a larger number of files, which could cause additional pressure on the namenode. However, for many deployments this should not be a major issue. There's a detailed discussion of this in the paper.


  1. Sandeep,
    I am trying to store plain text data onto HDFS in RCFile format. then would like to do benchmarking for reads across various columns. I am not able to find API documentation, JAR file that has implementation for RCFile. All i found was hive supports storage of data on HDFS in RC File format. i am looking at Input/OutputFormat APIs for RCFile. Please share.

  2. I was going through your paper where you have done benchmarking across TXT, SEQ and CIF. Could you please share that code (github project?). i am trying to do something similar with Parquet/RCFile and possibly CIF. We need to narrow down on a columnar storage format.

  3. Deepak,

    The code that I wrote for the benchmarks was at a previous job, and it is not available under an open-source license. As for RCFile APIs, I'd recommend pinging the Hive-dev or Hive-users mailing lists. For a start, you can look at how Hive uses its ObjectInspectors in conjunction with the InputFormat to get at the data in RCFiles. Good luck!