Sandeep's Research Notes: sgd

Showing posts with label sgd. Show all posts

Wednesday, December 26, 2012

Sparkler: Large Scale Matrix Factorization Using SGD

Personalized recommendation is now a critical component for many content-based web services such as Netflix, Youtube, Pandora, and the various AppStores. The techniques developed in these contexts are now being adapted for use in less obvious content recommendation settings such as your social newsfeed (Facebook, Twitter), the updates from your professional network (LinkedIn), or even in figuring out what advertisements to show you (Google, Facebook, Yahoo, MSN).

There are many ways to solve this problem, but the technique that has been enjoying substantial success is matrix factorization. In particular, for large datasets, the matrix factorization problem is solved using a technique called Stochastic Gradient Descent (SGD). I've written about this before -- in particular, I've talked about some of the cool research on a distributed variant of SGD that was invented at IBM Almaden by Rainer Gemulla, Peter Haas, and Yannis Sismanis.

Solving the matrix factorization problem on very large matrices (with millions of users and potentially milions of items) is a hard problem that has many data management aspects to it. If the items you are trying to recommend are URLs on the web, you may be stuck with tens or hundreds of millions of items. Of course, since there is a large data management aspect to the problem, you ask "Can you use Hadoop to solve this problem?". Well, implementing SGD-like algorithms on Hadoop poses a major challenge. SGD is an iterative algorithm, and makes many passes over the input data. Hadoop is well known to be inefficient for iterative workloads. In fact, many research papers have been written pointing this out, and systems like HaLoop have been proposed to address these shortcomings.

One of my favorite platforms for iterative workloads is the Spark project out of Berkeley's AMP lab. Spark is a parallel programming framework in Scala that supports efficient iterative algorithms on datasets stored in the aggregate memory of a cluster. This is a great fit for SGD-style algorithms. With help from a summer intern, we tried prototyping a few DSGD algorithms on Spark on a large dataset. Much to our surprise, we found that Spark didn't perform nearly as well as we expected it to. In the figure below, the line titled "Broadcast" (using Spark with broadcast variables to hold the factors) performs slower and slower as the rank of the factor matrices is increased from 25 up to 400.

Spark's programming model (mutable accumulators and broadcast variables, immutable RDDs) requires the programmer to assume that the factor matrices will fit in the memory of a single node. For large datasets, this is not always practical: as the input matrix gets larger, so does the size of the factors. For example, for 100 million customers, to compute factorization of rank 200, one needs to store 200 x 100 million = 20 billion floating point numbers for the factor corresponding to customers -- that amounts to 80GB of data. Such a large data structure cannot be easily accommodated in the main memory of a commodity node today. This is especially true in the cloud, where it is substantially easier to get a cluster of virtual machines with aggregate memory that far exceeds 80GB rather than a small number of virtual machines, each with 80GB of memory. Even if this data structure is suitably partitioned, in DSGD, the cost of moving different partitions of the factors to the appropriate nodes using Spark's standard abstractions starts to dominate the overall time taken to factorize the matrix.

In a paper that just got accepted at EDBT 2013, we describe a solution to this problem. We built Sparkler, an extension to the Spark platform to make it easier to to solve large scale recommendation problems using DSGD. The main idea is the introduction of a simple distributed memory abstraction called a Carousel Map (CM) that a programmer is expected to use to hold the factor matrices during DSGD. CMs complement Spark's built-in abstractions like broadcast variables and accumulators that are designed for small mutable models. CMs provide a map API for handling large factors in the aggregate memory of the cluster -- with CMs, we no longer require that factor matrices fit in the main memory of a single node. CMs are carefully designed to exploit the access patterns for the factors in a DSGD algorithm so that most of the the time the lookups and gradient updates are to local memory. When a remote data item is requested, the data is arranged so that the likely cells to be accessed in the near future are bulk-transferred to the local node. The details are in the paper, which I'll link to as soon as the camera ready version is out. In an experimental comparison on various factorization tasks, Sparkler with CMs provided a 4x to 21x improvement in performance over plain Spark.

We also added a few other goodies in the Sparkler platform -- automatically picking a good layout for the data (including doing stratification, a key step to parallelize DSGD). This helps minimize data movement during factorization. We also automatically pick an appropriate number of partitions to trade-off partitioning overhead with the benefits of parallelization, and even laid the groundwork for automating fault-tolerance decisions for CMs.

Tuesday, October 11, 2011

Data Management and Machine Learning

Victor, Hogwild, Felix, Tuffy: Chris Re has been doing some very interesting work at the intersection of databases and machine learning at the University of Wisconsin. Victor proposes to support many kinds of learning tasks that can be solved using Stochastic Gradient Descent (SGD) in a traditional relational database. Victor provides a python interface to design your learning task and it is executed using the UDF mechanism available in the database. They show that it is not particularly hard to solve a variety of statistical problems with this infrastructure. As expected linear models, such as logistic regression can easily be implemented. The examples go on to show conditional random fields, min-cut, and factored models.

Felix is a relational optimizer for statistical inference. In particular, if you want to solve a Markov Logic Network based on data sitting in some database (say postgres), Felix is your guy. Check out this example. In a VLDB paper this year they describe a system called Tuffy -- one of the components that Felix uses. I haven't had to play with Bayesian Networks or Graphical Models since my machine learning course back in grad school, so I can more easily see the applicability of regression and factored models compared with MLNs :-) Any pointers to applications where MLNs have been used successfully to solve big-data learning tasks will be much appreciated.

A recent paper (HOGWILD) points out some very interesting results on parallelizing SGD algorithms. SGD is an incredibly convenient way to deal with large data sets while trying to learn statistical models. A particularly thorny issue of parallelizing an SGD algorithm is one of concurrent updates to a shared model from different threads. While standard techniques of locking and synchronization can be used to make sure updates to the model follow some serial order, they often end up being the critical performance bottleneck.

In HOGWILD, the authors show that for optimization problems where the model is sparse, that is concurrent updates have a low probability of causing conflicting updates, it is reasonable to proceed with the updates without locking or synchronization. While this might cause some threads (or processes) to occasionally over-write each other's work, the convergence rate is still close to optimal. On several machine learning tasks, such as matrix factorization and learning sparse SVMs -- they show that HOGWILD is at least as as good and often substantially faster than known serial methods and previous parallel techniques. This is a really cool result, and I've used it in a different context in my current research with impressive results! I'll describe that work in the coming weeks...

Tuesday, September 20, 2011

Parallel Stochastic Gradient Descent

I've been learning about stochastic gradient descent (SGD) and why people seem to be excited about it in the context of building models over large datasets. Rainer Gemulla, Peter Haas, Erik Nijcamp, and Yannis Sismanis published a really cool paper at KDD this year that was my introduction to the problem.

At a high level. SGD is simply an optimization technique that can be used to minimize some objective function (such as a loss function). The main difference from traditional "batch" gradient descent is that SGD samples from the data at each step. That is, the model can be updated after evaluating the gradient of the loss function for a sample of the input data -- a sample that can be as small as a single point! SGD can achieve incredibly fast performance on a variety of machine learning tasks -- and often converges to a good solution much faster than conventional batch gradient descent. On large data sets, the performance difference could be an order of magnitude or more. There are, of course, lots of caveats about data size, learning rates, sampling techniques, and convergence rates, when it comes to SGD.

The KDD paper focuses on matrix factorization -- a technique that has received a lot of interest in the context of the "Netflix Problem". Given a sparse C by I matrix M of C customers and I items (say movies), where each entry in the matrix contains a rating, the problem is one of finding factors of the matrix A , B such that the matrix M' = A x B matches the existing values in M, and offers a prediction for the missing values in M. There are many techniques for solving this problem. However, as the matrix M gets larger, and you have billions of ratings from millions of customers, the problem becomes too big to solve on a single machine.

Rainer's KDD paper shows how you can use a distributed version of SGD to solve this problem using a cluster of machines. The central idea in the paper is a clever partitioning of the matrix so that each node in the cluster can work on a partition of the data and update the model (the factors A and B) without fine grained coordination with other nodes in the cluster. The bulk of the paper deals with proving why their partitioning strategy is statistically sound. The paper describes an implementation of this using R and Snowfall for small clusters. They also describe an implementation of this algorithm using Hadoop. The details of the implementation are tricky and complicated -- you need to achieve a certain partitioning of the data, there are certain communication patterns on the model that are not a natural fit for Hadoop resulting in some inelegant MapReduce code, and a certain amount of additional coordination is required for efficient performance. However, the results are impressive -- the Distributed SGD algorithm converges to a good solution faster and scales way better than alternative techniques.

As one can imagine, Netflix isn't the only company that needs to solve large recommendation problems. There are many large content publishers, marketplaces, and retailers working hard on good solutions to this problem. What's more, the ability to factorize large matrices is good not just for content recommendation. It can be used as a building block for clustering algorithms, topic detection, and even risk analytics. Neat stuff. This should enable some interesting new recommendation-style applications and provide alternate/faster implementations for existing ones.