- When you read a dataset laid out in CIF, you end up touching fewer HDFS blocks than you would with RCFile.
- It is possible to add derived columns in a way that they are co-located correctly with the existing data. With RCFile, as it stands, the only option is to re-load the entire dataset.
Let me explain the first point with some numbers. Assume you're reading a dataset with three columns: a 4-byte integer (X), a 100 byte string (Y), and an 8-byte float (Z). Further, assume that you have 100 million rows. That gives us 100M * (112) = 11200Mi bytes ~ 11.2 GB. Assuming 64MB blocks, the data will be laid out over ~175 blocks when using RCFile. I'm completely ignoring the metadata overheads.
Suppose that you were only reading X, ideally, you'd want to read 4 * 100M ~ 400MB of data. With RCFile, you'd open each of the 175 blocks, read a small portion of the data there (~2.28MB) for X, then move on to the next block. With the CIF layout where you put each column in a separate file, you'd only read
~7 HDFS blocks to read all the X values. If you were only scanning Y, this would be 157 blocks, and for Z, 13 blocks. If the actual processing you were doing on the data were simple, such as just adding up the numbers -- which is typically the case for relational workloads, the per-block overheads (opening the block, interpreting the RCFile headers, seeking to the right place, etc.) become very substantial very quickly. As a result, CIF starts to have a major advantage over RCFile for wide rows where only a small number of columns are being scanned.
The second advantage is easier to understand. Given that HDFS is an append-only file system, adding derived columns while guaranteeing locality for the three replicas is very difficult with RCFile. Consider that you wanted to add a new column T to the dataset above that was computed by analyzing the string Y. An example would be some kind of feature extractor applied on a URL field. Now, this field T needs to be stored on the same nodes that the (X,Y,Z) columns for the row are stored on. The only way to guarantee this with the current design of RCFile is to re-write the file with (X,Y,Z,T) so that all the columns are organized and laid out correctly in the same HDFS block.
With CIF, the above is easily accomplished by adding the T data in a separate file and placing it next to the X,Y,Z files in each of the split directories of the dataset (Section 4.2 in the paper explains split-directories.) The RCFile design will need to be changed significantly using the ideas in CIF to accommodate such a feature.
The other side: RCFile advantages
The biggest advantage of RCFile is its simplicity. It fits beautifully within HDFS's constraints. CIF requires changing the HDFS placement policy (possible Hadoop-0.21.0 onwards) and also requires tracking a bunch of additional metadata. It also ends up creating a larger number of files, which could cause additional pressure on the namenode. However, for many deployments this should not be a major issue. There's a detailed discussion of this in the paper.