Thursday, June 30, 2011

2011 Hadoop Summit

I was at the Hadoop Summit in Santa Clara today. It was fun. The long rumored Hadoop spin-off was confirmed yesterday -- a bunch of the core infrastructure team under Eric14 is now a separate company called HortonWorks. Their current business plan is to sell training, consulting, and services around Apache Hadoop.

The post surprising part of the summit was the sheer number of new VC funded startups that were there. Last year, the major ones were Cloudera, Datameer, and Pentaho. This year a host of new companies appeared: MapR, Zettaset, Arista, PervasiveDataRush, SyncSort, and DataStax (previously Riptano).

One observation I had was that there seems to be a flurry of activity in the scalable PubSub system space. At least five systems were discussed/mentioned:

  1. Kafka from LinkedIn
  2. Scribe from Facebook
  3. Flume from Cloudera
  4. Hedwig from Yahoo
  5. Data Highway (Yahoo)

Of these I have a basic understanding of Kafka and Hedwig, but I haven't used either.They seem to have made different design choices. Kafka is aimed squarely at log collection ... they argue that some of their API choices were better than the ones offered by Scribe. Hedwig seems closer to a true scale-out queuing system with guaranteed in-order at-least-once delivery. It'll be fun to do an in-depth comparison at some point.

No comments:

Post a Comment