Friday, November 16, 2012

Cloudera's Impala

Now that the dust has settled on Cloudera's announcement of Impala, here are my notes. For some quick background, Monash had some early observations here and here. Marcel and Justin have an excellent blog post here.
  • Impala is a parallel query engine for data in HDFS or HBase (no support for inserts/updates/deletes, at least as of now)
  • Queries in Impala are in HiveSQL so you might be able to take your Hive workload and move parts of it to Impala
  • Impala does not use MapReduce
  • Impala does not have UDFs or UDAs, and doesn't support all the OLAP functions and complex nested subqueries
  • Impala is expected to be between 3x and 50x faster than Hive
  • Impala does support joins (unlike Dremel), and will soon have columnar storage through Trevni, but has a row-oriented runtime.
  • Impala has ODBC connectors to Microstrategy and Tableau
That said, here are some nice things about Impala that, IMHO, Monash missed:
  • Impala has a distinct advantage over something like Hadapt. The storage and replication is managed by HDFS instead of a proprietary subsystem that replicates the data that the Postgres nodes need to have available. One fewer thing to manage and administer.
  • Impala has a distinct advantage in terms of the cost of storage (for archival querying) over MPP vendors. HDFS is *way* cheaper than a bunch of SANs. I don't yet know how well the replication story in Aster, ParAccel, Vertica will hold up against HDFS for cost, reliability, and flexibility.
  • If you use Impala for your query processing, you can continue to use Hadoop for scale-out ELT -- on the same cluster. This might be much harder with other MPP databases: that'll probably require two separate clusters administered independently and connected with a "fat pipe" to ETL data from from the Hadoop cluster to the MPP cluster.
  • Impala costs $0 to try, you only pay for support/management tools :-)
Now what are the potential gotchas with Impala?
  • Since you have to have Impala daemons running on the cluster, where do you put them? Can you also run MapReduce on the nodes that run impalad?
  • Assuming you are running a large cluster with Impala jobs and MapReduce jobs, does the scheduler know how to assign resources across the two kinds of processes on the nodes?
  • If only certain nodes in the cluster are dedicated to be Impala nodes you may not be able to guarantee that all the HDFS data is locally available to some Impala worker. 
I'm sure all these problems can be solved (and are probably already being tackled). It will be interesting to see what sorts of workloads Impala serves well, and how well it plays with the rest of the workloads on a Hadoop cluster. I'm excited that Impala ups the game for structured data processing on Hadoop -- an area that I think still has a long way to go.

5 comments:

  1. Hi Sandeep,

    It's great to see your interest in the Impala project and thanks for putting this blog together. I agree with your assessment of Impala. I also thought I'd help answer the questions that you posted about Impala:

    * Since you have to have Impala daemons running on the cluster, where do you put them? Can you also run MapReduce on the nodes that run impalad?

    You should run an Impala daemon on every HDFS data node and/or HBase region server in the cluster along with MapReduce. During the beta, there is currently no good way to run Impala and MapReduce at the same time as they will compete for resources, but expect this story to improve as Impala moves to GA.


    * Assuming you are running a large cluster with Impala jobs and MapReduce jobs, does the scheduler know how to assign resources across the two kinds of processes on the nodes?

    For GA you will be able to set how many resources to provide to each service so that MapReduce and Impala can run on the same cluster. MapReduce will schedule within its resource allocation based on the schedulers it has today and Impala will respect it's resource allocation as well.


    * If only certain nodes in the cluster are dedicated to be Impala nodes you may not be able to guarantee that all the HDFS data is locally available to some Impala worker.

    That is correct. For Impala to run at expected performance you will need an Impala daemon on each node. Remote reads will dramatically reduce performance and increasingly bottleneck on the network as the data sizes grow.


    Please keep us posted on any additional feedback or questions you have.

    Thanks,
    Justin Erickson
    Product Manager for Cloudera Impala

    ReplyDelete
    Replies
    1. Interesting post !

      @Justin: Does cloudera have a plan to add UDF/UDAF/UDTF to Impala ? Seems like a major drawback if one want to move from Hive to Impala ...

      Cheers

      Delete
  2. Justin,

    Thanks for sharing these answers! I look forward to seeing these new features in GA.

    Sandeep

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Just an update
    Impala support for UDFs is available in Impala 1.2 and higher:
    http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_udf.html

    ReplyDelete