Friday, November 16, 2012

Cloudera's Impala

Now that the dust has settled on Cloudera's announcement of Impala, here are my notes. For some quick background, Monash had some early observations here and here. Marcel and Justin have an excellent blog post here.
  • Impala is a parallel query engine for data in HDFS or HBase (no support for inserts/updates/deletes, at least as of now)
  • Queries in Impala are in HiveSQL so you might be able to take your Hive workload and move parts of it to Impala
  • Impala does not use MapReduce
  • Impala does not have UDFs or UDAs, and doesn't support all the OLAP functions and complex nested subqueries
  • Impala is expected to be between 3x and 50x faster than Hive
  • Impala does support joins (unlike Dremel), and will soon have columnar storage through Trevni, but has a row-oriented runtime.
  • Impala has ODBC connectors to Microstrategy and Tableau
That said, here are some nice things about Impala that, IMHO, Monash missed:
  • Impala has a distinct advantage over something like Hadapt. The storage and replication is managed by HDFS instead of a proprietary subsystem that replicates the data that the Postgres nodes need to have available. One fewer thing to manage and administer.
  • Impala has a distinct advantage in terms of the cost of storage (for archival querying) over MPP vendors. HDFS is *way* cheaper than a bunch of SANs. I don't yet know how well the replication story in Aster, ParAccel, Vertica will hold up against HDFS for cost, reliability, and flexibility.
  • If you use Impala for your query processing, you can continue to use Hadoop for scale-out ELT -- on the same cluster. This might be much harder with other MPP databases: that'll probably require two separate clusters administered independently and connected with a "fat pipe" to ETL data from from the Hadoop cluster to the MPP cluster.
  • Impala costs $0 to try, you only pay for support/management tools :-)
Now what are the potential gotchas with Impala?
  • Since you have to have Impala daemons running on the cluster, where do you put them? Can you also run MapReduce on the nodes that run impalad?
  • Assuming you are running a large cluster with Impala jobs and MapReduce jobs, does the scheduler know how to assign resources across the two kinds of processes on the nodes?
  • If only certain nodes in the cluster are dedicated to be Impala nodes you may not be able to guarantee that all the HDFS data is locally available to some Impala worker. 
I'm sure all these problems can be solved (and are probably already being tackled). It will be interesting to see what sorts of workloads Impala serves well, and how well it plays with the rest of the workloads on a Hadoop cluster. I'm excited that Impala ups the game for structured data processing on Hadoop -- an area that I think still has a long way to go.