Wednesday, August 22, 2012

Amazon Glacier

Amazon recently introduced Glacier, a cloud data archival service. Based on James Hamilton's post, is presumably built on a large, shared, tape infrastructure. With the addition of this service, Amazon now provides a very compelling set of storage services to suit a whole variety of application needs.

Service Purpose Pricing (As of August 2012)
ElastiCache Memcached service 9 cents/hour for 1.7GB node ( = 3800 cents/GB-month, approx.)
DynamoDB Low latency, guaranteed throughput, indexed, SSD-based NoSQL store 100 cents/GB-month
SimpleDBOlder, more complex NoSQL store25 cents/GB-month
EC2 Local Storage Instance-local storage Approx 36 cents/GB-month
EBS Reliable block store 10 cents/GB-month
S3 Cross-datacenter durable storage 12.5 cents/GB-month or, for reduced redundancy, 9.3 cents/GB-month
Glacier Archival storage 1 cent/GB-month

Each service has a different pricing structure based on the I/O requests you subject it to, and different guarantees about how long the requests might take. But if you look at it purely from a storage cost standpoint, you have a clear idea of how much it costs a developer to keep a piece of data in memory, on SSDs, on local on spinning disks, in-datacenter-redundant disks, cross-datacenter-redundant disks, or on tape!

Wednesday, August 8, 2012

Scala, DSLs, and Big Data


I've been playing with Scala for a bunch of different projects the last several months. One of the most interesting things Scala offers for large-scale data management is the possibility of easy-to-build data processing DSLs (Domain Specific Languages).

Why do we suddenly need DSLs? SQL has served us well for the last several decades, and perhaps will continue to do so for most conventional business query processing needs. However, several new classes of data processing problems seem to require programmatic data access, not just query processing. SQL is a great language when your data processing application looks like it needs to perform joins, selections, projections, and aggregation. A vast majority of data processing falls under this category. When the task at hand is substantially different, such as assembling and analyzing DNA sequences, understanding sentiment about various topics from social media feeds, recommending content based on click streams and ratings, or building up ETL workflows (a task that SQL systems have traditionally been bad at), or statistical/predictive analysis - SQL is neither convenient nor a particularly effective interface for expressing the logic involved. Sure, you can solve the problem using SQL + UDFs  + scripting. But there are more convenient ways to do it. Besides, SQL optimizers work hard to figure out a good join order and join plan, de-correlate nested subqueries when required, push down predicates. If you application doesn't really need these sorts of optimization, the inconvenience of expressing your logic in SQL doesn't buy you anything!

The promise of embedded DSLs in Scala is that you can devise a sub-language that is convenient for solving problems in the domain of interest. At the same time, you might also be able to specify certain domain specific optimizations that might help execute the DSL program more efficiently. However, being an embedded DSL means that snippets of code in this DSL are also valid Scala code. So you still get the benefits of the Scala compiler and the optimizations that the JIT compiler can provide (Scala runs on the JVM).  On the easy-to-build front, this also means you don't have to build separate tooling (editors, compilers, debuggers) for the DSL -- you can simply continue to use the tooling around Scala. DSLs provide an interesting new tradeoff between building a library in a host language vs. building an entirely new language.

People have done lots of interesting things in this space:
  • Scalding (A Scala DSL built on Cascading, popular at Twitter)
  • ScalOps (From Tyson Condie and others, formerly at Yahoo Research)
  • Spark (from the AMP Lab at Berkeley)
  • Delite and OptiML (from Stanford)

The SIGMOD demo I did this year for Clydesdale used a simple SQL-like DSL to express the star schema benchmark. I did this mostly to convince myself (and others) that it was simpler/easier to use this DSL mechanism than build a SQL parser and compiler from scratch. It certainly was. Also it let us use java libraries in our data processing without the inconveniences that UDFs (user defined functions) cause in many databases.

Another project where I found this very useful was as a DSL for analyzing DNA sequences and looking for statistically significant SNPs in large data sets. I haven't yet gotten around to writing that work up in detail, but it was a fun project that showed how a DSL that compiles down to the Spark runtime can handily beat an approach like combining Hadoop and R (such as the one in RHIPE or Revolution R's R/Hadoop) and having the developer write code that spans the two systems in awkward ways.

I expect we'll see more interesting DSLs that compile down to runtimes on Hadoop, MPI, Spark, Pregel/Giraffe and other frameworks in the next few years.