Tuesday, September 20, 2011

Parallel Stochastic Gradient Descent

I've been learning about stochastic gradient descent (SGD) and why people seem to be excited about it in the context of building models over large datasets. Rainer Gemulla, Peter Haas, Erik Nijcamp, and Yannis Sismanis published a really cool paper at KDD this year that was my introduction to the problem.

At a high level. SGD is simply an optimization technique that can be used to minimize some objective function (such as a loss function). The main difference from traditional "batch" gradient descent is that SGD samples from the data at each step. That is, the model can be updated after evaluating the gradient of the loss function for a sample of the input data -- a sample that can be as small as a single point! SGD can achieve incredibly fast performance on a variety of machine learning tasks -- and often converges to a good solution much faster than conventional batch gradient descent. On large data sets, the performance difference could be an order of magnitude or more. There are, of course, lots of caveats about data size, learning rates, sampling techniques, and convergence rates, when it comes to SGD.

The KDD paper focuses on matrix factorization -- a technique that has received a lot of interest in the context of the "Netflix Problem". Given a sparse C by I matrix M of C customers and I  items (say movies), where each entry in the matrix contains a rating, the problem is one of finding factors of the matrix A , B such that the matrix M' = A x B matches the existing values in M, and offers a prediction for the missing values in M. There are many techniques for solving this problem. However, as the matrix M gets larger, and you have billions of ratings from millions of customers, the problem becomes too big to solve on a single machine.

Rainer's KDD paper shows how you can use a distributed version of SGD to solve this problem using a cluster of machines. The central idea in the paper is a clever partitioning of the matrix so that each node in the cluster can work on a partition of the data and update the model (the factors A and B) without fine grained coordination with other nodes in the cluster. The bulk of the paper deals with proving why their partitioning strategy is statistically sound. The paper describes an implementation of this using R and Snowfall for small clusters. They also describe an implementation of this algorithm using Hadoop. The details of the implementation are tricky and complicated -- you need to achieve a certain partitioning of the data,  there are certain communication patterns on the model that are not a natural fit for Hadoop resulting in some inelegant MapReduce code, and a certain amount of additional coordination is required for efficient performance. However, the results are impressive -- the Distributed SGD algorithm converges to a good solution faster and scales way better than alternative techniques.

As one can imagine, Netflix isn't the only company that needs to solve large recommendation problems. There are many large content publishers, marketplaces, and retailers working hard on good solutions to this problem. What's more, the ability to factorize large matrices is good not just for content recommendation. It can be used as a building block for clustering algorithms, topic detection, and even risk analytics. Neat stuff. This should enable some interesting new recommendation-style applications and provide alternate/faster implementations for existing ones.


  1. This comment has been removed by the author.

  2. Hi Sandeep,
    I have one doubt ,Parallel Versions of batch Learning algorithms are available ,then why do we need to use parallel stochastic gradient descent?