Sandeep's Research Notes: August 2011

Another non-technical post. I was thinking through the amazing variety of industries in which web and related technologies have caused disruptive changes. There's a whole bunch of disruptions along the theme of monetizing the "long tail" --

Long Tail Disruptions
Existing Industry	Disruptive Technology	Companies
Media Advertising	Online Ad networks, exchanges	AdSense
Retail	Online Retail/Auctions	Amazon, eBay
Banking	P2P Lending through web-based apps	Lending Club
Hotels, B&B	P2P Space Sharing through web-based apps	AirBnB
Content Publishing	Self Publishing with hosted blogging platforms	Blogger, WordPress, etc.
Discounted High-End Brand Retail	Web, Social Media +Daily Deal Email	Gilt.com, Fab.com, ...
Tax Accounting	Web-based Apps	TurboTax
Financial Advice	Web-based Apps	Mint.com
Broadcast Television	Streaming Video	Netflix, Hulu

And now the IT industry itself is possibly undergoing several disruptive changes:

IT Disruptions
Middleware (DBs, AppServers, etc.)	PaaS	Amazon Web Services, Google AppEngine, Force.com, Heroku, SpringSource, CloudFoundry, Rackspace Cloud
Data Warehousing	MapReduce, Hadoop	Yahoo, Hortonworks, Cloudera, IBM, ...
Short Request Processing	NoSQL stores	DataStax, Couchbase, Grid vendors

Some of these disruptions are likely to have a small effect, and some of these are probably going to change the landscape of IT. It sure is an interesting time to be in technology!

I've been asked this a few times -- how would you compare the column store on Hadoop that we built (CIF) with HBase? HBase offers column families that are stored separately -- so, you should be able to store each column in a separate file if you put it in its own column family. This should work, and you should be able to get some amount of I/O elimination when scanning a small subset of columns from a much larger table, HBase introduces several inefficiencies in comparison to using the approach in our paper.

Path length: Reading from HBase instead of directly reading a file (or a set of files) from HDFS means each byte now flows through more code. Data gets read off the disk by the DataNode process in HDFS, then it moves to the DFSClient in the HBase region server, and finally moves to the (key,value) pair returned by the InputFormat that reads using the HBase client object. These additional steps the data has to travel certainly adds overhead.
Read logic: The read logic requires checking in at least two places before a row can be returned -- the SSTable on disk, and the state in memory in the Memtable (to use Bigtable terms). This is also additional overhead compared to simply reading the bytes off HDFS.
Schema free: HBase wasn't designed to be used with fixed schemas for a table -- as a result, every row that is stored also stores the names of every column in that row. Knowing a schema allows CIF to store the metadata separately. While in HBase the repetition of the column names can be ameliorated by compressing away the SSTables, this still has a cost in terms of CPU cycles spent decompressing this data while reading. Again, more unnecessary overheads.
Load speeds: If you aren't using a special bulk-load API and instead use the standard insert API to insert a new batch a single row-at-a-time, you'll send all the data twice over the network -- once to the logs, and a second time when the SSTables get replicated. If you're network bound during load -- and you should be if you are replicating, and you're not doing any crazy processing -- you'll be at half the load pace as you could get to with CIF.
MapReduce Outputs: This one's a little tricky -- it is not clear how to make sure that an OutputFormat that takes the results of a MapReduce job and inserts it into HBase in a way that is compatible with MapReduce's fault-tolerance assumptions. If the OutputFormat inserts key-values directly into HBase, and then halfway through, the task fails -- there's no way to "roll-back" the stuff already inserted into HBase. The last I checked, you couldn't do an arbitrary insert sequence in an atomic way with HBase. There may be application specific ways out of this -- but with CIF, you simply end up writing a new set of files -- if the task fails, you delete them as part of the cleanup. Easy.

While HBase could be used as a columnar storage layer for scan-heavy applications, it would probably be a lot slower. However, if you need to support, point lookups, updates, and deletes -- this cost is justifiable. For applications that are append-only, and don't need to deal with point updates, CIF provides an alternative with a much higher performance.

An easy way to verify this would be to run a simple benchmark and measure the scan performance of CIF and HBase:-) I'll try and do that when I get a chance and post the results.

Sandeep's Research Notes

Tuesday, August 16, 2011

More on Disruptions

Monday, August 8, 2011

HBase and Column Stores