Sandeep's Research Notes

Thursday, May 30, 2013

IBM's BLU

I'm very excited to see that IBM has finally announced BLU, an architecture-aware compressed columnar engine in DB2. My old friends at Almaden Research worked super hard on this project, and were waiting until the product was released before they could brag about the stellar performance results that BLU achieved internally. I'm looking forward to seeing the BLU papers finally getting out to the research community. Here's Curt Monash's summary of the product announcement.

Edit:
Here's a great video describing Blink from Guy @ Almaden.

Sunday, May 26, 2013

Javascript Applications

In a previous blog post, I talked about how Javascript is already the language in which so many mobile and web applications are being built. It is not a huge stretch to expect at least some of the back-end code to move to Javscript. Node.js and backbone.js are already making this easier. What I've recently been amazed by is how much of the native desktop experience can be re-created on the browser.
Check out Epic/Mozilla's port of the Unreal 3D engine to HTML5 here. I remember a time growing up when my desktop was too slow to run Unreal, and today, we can run it in the browser! Given this kind of performance, applications like photo-editing or even light-weight video editing could be delivered over the web with snappy interfaces that don't require round-trips to the back-end for everything. Pushing some of the processing to the cloud will certainly make certain kinds of editing that were too demanding for an average desktop processor possible with a cluster of GPU-based servers on the cloud. It will also likely make new kinds of workflows and actions possible.

As for the less sexy back-end logic, having that be in Javascript, and running efficiently will certainly open up new possibilities. This isn't a new idea -- Netscape tried this in the mid-nineties and it didn't really take off. However, Google's V8 engine has made running Javascript applications so much more efficient, that replacing a Python/Django or Ruby On Rails stack with Javascript seems entirely reasonable. Consider the kinds of things you could do with this -- You could build a full web application in a single language (Javascript), and optimize it differently for a desktop browser, or a tablet, or a phone. As your application evolves, you might find it easier to move some functionality back and forth between the server and the client. I expect we'll see application servers provide a good environment for Javscript apps much like we have for Java (Tomcat, Weblogic, Websphere etc.).

I expect we'll see ever more interesting and sophisticated apps delivered on the desktop browser, and watch them quickly flow down to browsers on tablets, and eventually phones.

Wednesday, May 22, 2013

Interesting Papers

Here's an assortment of interesting papers from some of my colleagues in the recommendation/machine learning space that I've recently read.

BPR: Bayesian Personalized Ranking from Implicit Feedback, Rendle et. al.
Supercharging Recommender Systems using Taxonomies for Learning User Purchase Behavior, Kanagal et. al. in PVLDB 2012
Latent factor models with additive and hierarchically-smoothed user preferences, Ahmed et. al. in WSDM 2013
Matrix Approximation under Local Low-Rank Assumption, Joonseok Lee et. al. ICLR
Sibyl: A system for large scale supervised machine learning

Wednesday, January 23, 2013

Productivity

I've recently discovered three interesting resources that have been very helpful in improving my general productivity. I'm sharing them here.

Any.do

Any.do is a really well designed TODO app with a gesture-based UI that works on iOS, Android, and on Chrome on the desktop. I've used several TODO apps before, but this one is my favorite because even after several weeks, I'm still using it.

The UI is extremely low-friction. It has enough features to be useful without being too complicated. You can assign times/locations to TODOs, get reminders, postpone, or attach notes to items. Everything is classified Today/ Tomorrow/ Upcoming/ Someday. It includes free syncing across all your devices. It even integrates cleverly with GMail. And, it makes cute noises when you finish items. The same kind of reward that made Angry Birds so addictive!

The Power of Habit by Charles Duhigg

This book by Charles Duhigg is on the neuro-psychology of habits. Like most business-type books this one is much longer than it needs to be. But, it gives you a really useful way to reason about habits by describing the cue-routine-reward loop and how the brain stores habits in a different part responsible for "automatic" functions. Techniques here can make it a little bit easier to build new (good) habits. I've been able to build at least one good habit. I'm still trying to get rid of some bad ones l picked up in Grad school (late night snacking)!

HBR Ideacast

I started listening to Harvard Business Review's Ideacast a couple of years ago. HBR produces short (15-20 min) podcasts - mini interviews with accomplished people (with a focus on business) on different topics and makes it available for free. Some podcasts will help you discover some great books/ideas/tools, while others are moderately entertaining. I've gotten ideas on time management, improved how I communicate with managers, discovered some great books, gained a more sympathetic understanding for why large organizations act the way they do, and have gotten better at dealing with meetings. Pretty good mileage from a podcast!

If you have an interesting productivity resource you'd like to share, leave a comment!

Wednesday, December 26, 2012

Sparkler: Large Scale Matrix Factorization Using SGD

Personalized recommendation is now a critical component for many content-based web services such as Netflix, Youtube, Pandora, and the various AppStores. The techniques developed in these contexts are now being adapted for use in less obvious content recommendation settings such as your social newsfeed (Facebook, Twitter), the updates from your professional network (LinkedIn), or even in figuring out what advertisements to show you (Google, Facebook, Yahoo, MSN).

There are many ways to solve this problem, but the technique that has been enjoying substantial success is matrix factorization. In particular, for large datasets, the matrix factorization problem is solved using a technique called Stochastic Gradient Descent (SGD). I've written about this before -- in particular, I've talked about some of the cool research on a distributed variant of SGD that was invented at IBM Almaden by Rainer Gemulla, Peter Haas, and Yannis Sismanis.

Solving the matrix factorization problem on very large matrices (with millions of users and potentially milions of items) is a hard problem that has many data management aspects to it. If the items you are trying to recommend are URLs on the web, you may be stuck with tens or hundreds of millions of items. Of course, since there is a large data management aspect to the problem, you ask "Can you use Hadoop to solve this problem?". Well, implementing SGD-like algorithms on Hadoop poses a major challenge. SGD is an iterative algorithm, and makes many passes over the input data. Hadoop is well known to be inefficient for iterative workloads. In fact, many research papers have been written pointing this out, and systems like HaLoop have been proposed to address these shortcomings.

One of my favorite platforms for iterative workloads is the Spark project out of Berkeley's AMP lab. Spark is a parallel programming framework in Scala that supports efficient iterative algorithms on datasets stored in the aggregate memory of a cluster. This is a great fit for SGD-style algorithms. With help from a summer intern, we tried prototyping a few DSGD algorithms on Spark on a large dataset. Much to our surprise, we found that Spark didn't perform nearly as well as we expected it to. In the figure below, the line titled "Broadcast" (using Spark with broadcast variables to hold the factors) performs slower and slower as the rank of the factor matrices is increased from 25 up to 400.

Spark's programming model (mutable accumulators and broadcast variables, immutable RDDs) requires the programmer to assume that the factor matrices will fit in the memory of a single node. For large datasets, this is not always practical: as the input matrix gets larger, so does the size of the factors. For example, for 100 million customers, to compute factorization of rank 200, one needs to store 200 x 100 million = 20 billion floating point numbers for the factor corresponding to customers -- that amounts to 80GB of data. Such a large data structure cannot be easily accommodated in the main memory of a commodity node today. This is especially true in the cloud, where it is substantially easier to get a cluster of virtual machines with aggregate memory that far exceeds 80GB rather than a small number of virtual machines, each with 80GB of memory. Even if this data structure is suitably partitioned, in DSGD, the cost of moving different partitions of the factors to the appropriate nodes using Spark's standard abstractions starts to dominate the overall time taken to factorize the matrix.

In a paper that just got accepted at EDBT 2013, we describe a solution to this problem. We built Sparkler, an extension to the Spark platform to make it easier to to solve large scale recommendation problems using DSGD. The main idea is the introduction of a simple distributed memory abstraction called a Carousel Map (CM) that a programmer is expected to use to hold the factor matrices during DSGD. CMs complement Spark's built-in abstractions like broadcast variables and accumulators that are designed for small mutable models. CMs provide a map API for handling large factors in the aggregate memory of the cluster -- with CMs, we no longer require that factor matrices fit in the main memory of a single node. CMs are carefully designed to exploit the access patterns for the factors in a DSGD algorithm so that most of the the time the lookups and gradient updates are to local memory. When a remote data item is requested, the data is arranged so that the likely cells to be accessed in the near future are bulk-transferred to the local node. The details are in the paper, which I'll link to as soon as the camera ready version is out. In an experimental comparison on various factorization tasks, Sparkler with CMs provided a 4x to 21x improvement in performance over plain Spark.

We also added a few other goodies in the Sparkler platform -- automatically picking a good layout for the data (including doing stratification, a key step to parallelize DSGD). This helps minimize data movement during factorization. We also automatically pick an appropriate number of partitions to trade-off partitioning overhead with the benefits of parallelization, and even laid the groundwork for automating fault-tolerance decisions for CMs.

Friday, November 16, 2012

Cloudera's Impala

Now that the dust has settled on Cloudera's announcement of Impala, here are my notes. For some quick background, Monash had some early observations here and here. Marcel and Justin have an excellent blog post here.

Impala is a parallel query engine for data in HDFS or HBase (no support for inserts/updates/deletes, at least as of now)
Queries in Impala are in HiveSQL so you might be able to take your Hive workload and move parts of it to Impala
Impala does not use MapReduce
Impala does not have UDFs or UDAs, and doesn't support all the OLAP functions and complex nested subqueries
Impala is expected to be between 3x and 50x faster than Hive
Impala does support joins (unlike Dremel), and will soon have columnar storage through Trevni, but has a row-oriented runtime.
Impala has ODBC connectors to Microstrategy and Tableau

That said, here are some nice things about Impala that, IMHO, Monash missed:

Impala has a distinct advantage over something like Hadapt. The storage and replication is managed by HDFS instead of a proprietary subsystem that replicates the data that the Postgres nodes need to have available. One fewer thing to manage and administer.
Impala has a distinct advantage in terms of the cost of storage (for archival querying) over MPP vendors. HDFS is *way* cheaper than a bunch of SANs. I don't yet know how well the replication story in Aster, ParAccel, Vertica will hold up against HDFS for cost, reliability, and flexibility.
If you use Impala for your query processing, you can continue to use Hadoop for scale-out ELT -- on the same cluster. This might be much harder with other MPP databases: that'll probably require two separate clusters administered independently and connected with a "fat pipe" to ETL data from from the Hadoop cluster to the MPP cluster.
Impala costs $0 to try, you only pay for support/management tools :-)

Now what are the potential gotchas with Impala?

Since you have to have Impala daemons running on the cluster, where do you put them? Can you also run MapReduce on the nodes that run impalad?
Assuming you are running a large cluster with Impala jobs and MapReduce jobs, does the scheduler know how to assign resources across the two kinds of processes on the nodes?
If only certain nodes in the cluster are dedicated to be Impala nodes you may not be able to guarantee that all the HDFS data is locally available to some Impala worker.

I'm sure all these problems can be solved (and are probably already being tackled). It will be interesting to see what sorts of workloads Impala serves well, and how well it plays with the rest of the workloads on a Hadoop cluster. I'm excited that Impala ups the game for structured data processing on Hadoop -- an area that I think still has a long way to go.

Monday, October 8, 2012

MongoDB: Why Database folks have it wrong.

It is easy for relational database folks to dismiss MongoDB as "silly". Research veterans have poked fun at it, there is a great video on youtube ("MongoDB is Web Scale") that cracked me up with its criticism of Mongo fan-boys, a recent VLDB paper from Wisconsin compares MongoDB and Sharded SQLServer on the YCSB benchmark showing that Mongo's performance falls way behind.

I think all of this misses the point. IMHO, MongoDB is not trying to be a more scalable, higher performance database for application developers who couldn't get this from sharded MySQL or any of the other commercial offerings. To understand why Mongo is so interesting, you have to look at the entire application programming stack, and how it is different from the previous generation. The standard 3-tier architecture of database, app/web-servers, browsers is getting rebuilt using very different technologies driven by the enormous amounts of effort pouring into building applications/sites for mobile devices:

A few years ago, building a Web 2.0 application would require: 1) a team that knew Javascript/AJAX stuff really well, 2) a few guys that knew how to write server-side scripting logic in PHP, or Ruby, or Python, 3) basic MySQL skills because you could use the ORM in the app language (Rails, Django etc.) to do most of the talking to the DB. As more and more applications are getting built by highly skilled web-developers, smaller shops may find that being able to build a larger part of their application in a single language, (say Javascript!) might help them get the job done with fewer people as opposed to this traditional choice.

Given the fact that Javascript is a less than ideal language for developing complex applications, there are many pieces of technologies enabling this push. Technologies like node.js provide server-side scripting in Javascript. Given how much effort has gone into engineering V8, it is no surprise that some people such as the mobile app folks at LinkedIn are beginning to notice that node.js may actually provide faster throughput than RubyOnRails. Now, the same team of web developers building browser code can also build a big chunk of the application logic in Javascript. Technologies like backbone.js provide models with bindings to key-value stores and rich collection APIs to query/process them. This is making it easier to build complex applications in Javascript.

Now, can we provide an persistence layer here that is extremely agile, flexible, and one that can be accessed from code in a Javscript VM? I think MongoDB is trying to be that persistence layer. Relational databases still offer too much friction for flex-schema applications written in Javascript. A database that does a really good job of storing JSON is likely to be more successful. There are certainly many ways to talk to a relational database from a Javascript VM: node.js has ORMs that work with several databases as described here. For high-performance web-sites built using Javascript for both client-side and server-side scripting, talking to relational DBs through ORMs in JS is probably the stable answer for now. But as MongoDB or a similar JSON-store matures, the friction that ORM+RDBM involves may start to outweigh the performance benefits of RDBMSes.

If all three tiers of the stack can now be programmed in Javascript, this opens up very interesting new questions: with the right compiler technologies, we can fluidly determine the boundary between client-side logic and server-side logic as the application and the client hardware (phones, tablets) evolve! Deploying the app on a CPU-poor device? Push more of the processing logic to the server side. Is the app being used on a desktop? Then push more of the processing to the client side. Having everything in Javascript is going to make it much easier to build applications like this!