Tuesday, June 25, 2013

Big Fast Data Blog from Wisconsin

My advisor Jignesh Patel has a new blog on data management. This is going to be a great place to get the latest on data management from Wisconsin. His first post is on BitWeaving -- a SIGMOD 2013 paper that takes the column store idea to its logical extreme. Check it out here.

Thursday, June 13, 2013

Cassandra Summit 2013

I spent some time at the recent Cassandra Summit 2013 in San Francisco. It was great to see such a big, engaged community of Cassandra users in such a short while. Some observations:

As anyone who's watching this space knows, Netflix was and continues to be one of the biggest users of Cassandra in production. Adrian Cockroft from Netflix had an interesting presentation --

  • Netflix has a cross-datacenter Cassandra cluster up and running. The "datacenter" here is really an Amazon availability zone. But they've spanned North America and Europe and run several benchmarks on it successfully.
  • Netflix has a bunch of OSS code out on Github to make it easy to spin up, expand, shrink, and shut down their Cassandra clusters on AWS. They continue to release other code under the "Netflix OSS" brand.
  • Astyanax, a Java client for Cassandra seems to be Netflix's preferred way of interacting with Cassandra.
  • Rather intriguingly, Adrian mentioned that it might be nice to have an Astyanax port to DynamoDB, which might be quicker for small clusters, and use a full-fledged Cassandra cluster only for larger sizes at which DynamoDB gets too expensive. I wonder, as Amazon continues to drop its prices for DynamoDB, if the point at which you'll need to spin up your own Cassandra cluster will continue to move towards larger and larger clusters.
  • There was some discussion of the cost of running Cassandra on bare-metal vs on AWS, and Adrian seemed to indicate that for them, the additional cost of running on AWS was outweighed by the benefits of speed of execution. I wonder if once you have a relatively stable system, and the cost of running it starts to become an important component of the system, having it on Cassandra instead of DynamoDB will give you more control should you decide to move it on to your own datacenter with the right hardware for Cassandra.
  • Adrian also mentioned that for their analytics, they still have a Teradata, but it is slowly being replaced by a combination of Amazon RedShift and Hadoop. He did say that when your data is spread across multiple Cassandra clusters, joining across them to run some analytics on Hadoop can be a lot of work.
  • I found the transition off of Teradata really interesting. Netflix is one of the largest companies with a very large portion of their IT infrastructure in the cloud. The pace at which they move workloads off from Teradata into RedShift and Hadoop will likely be much higher than what typical enterprises can do, but I'll be watching this as an indicator of what might come. 
There were several other talks from startups and mid-sized enterprises describing how they were using Cassandra. There still seems to be some confusion on consistency and how much to think about it. But for the most part, it seemed to me that the community has decided to use quorum, and not worry too much about it. As one speaker put it, "if once in a year, we lose how far into the video you are when you resume playing later, the world doesn't end". The biggest class of data that continue to be managed by Cassandra seems to be user-specific data (profiles, bookmarks, activity history). For this, I think Cassandra makes perfect sense.