Tuesday, October 11, 2011

Data Management and Machine Learning

Victor, Hogwild, Felix, Tuffy: Chris Re has been doing some very interesting work at the intersection of databases and machine learning at the University of Wisconsin. Victor proposes to support many kinds of learning tasks that can be solved using Stochastic Gradient Descent (SGD) in a traditional relational database. Victor provides a python interface to design your learning task and it is executed using the UDF mechanism available in the database. They show that it is not particularly hard to solve a variety of statistical problems with this infrastructure. As expected linear models, such as logistic regression can easily be implemented. The examples go on to show conditional random fields, min-cut, and factored models.

Felix is a relational optimizer for statistical inference. In particular, if you want to solve a Markov Logic Network based on data sitting in some database (say postgres), Felix is your guy. Check out this example. In a VLDB paper this year they describe a system called Tuffy -- one of the components that Felix uses. I haven't had to play with  Bayesian Networks or Graphical Models since my machine learning course back in grad school, so I can more easily see the applicability of regression and factored models compared with MLNs :-) Any pointers to applications where MLNs have been used successfully to solve big-data learning tasks will be much appreciated.

A recent paper (HOGWILD) points out some very interesting results on parallelizing SGD algorithms. SGD is an incredibly convenient way to deal with large data sets while trying to learn statistical models. A particularly thorny issue of parallelizing an SGD algorithm is one of concurrent updates to a shared model from different threads. While standard techniques of locking and synchronization can be used to make sure updates to the model follow some serial order, they often end up being the critical performance bottleneck.

In HOGWILD, the authors show that for optimization problems where the model is sparse, that is concurrent updates have a low probability of causing conflicting updates, it is reasonable to proceed with the updates without locking or synchronization. While this might cause some threads (or processes) to occasionally over-write each other's work, the convergence rate is still close to optimal. On several machine learning tasks, such as matrix factorization and learning sparse SVMs -- they show that HOGWILD is at least as as good and often substantially faster than known serial methods and previous parallel techniques. This is a really cool result, and I've used it in a different context in my current research with impressive results! I'll describe that work in the coming weeks...

Tuesday, October 4, 2011

Tenzing: SQL on MapReduce

Google described Tenzing, their implementation of SQL on MapReduce at VLDB this year. The paper explains that they built it because their data-warehouse was getting too expensive and was not able to keep pace with the demands the users made -- in terms of complex queries (SQL was always getting in the way), and performance (the ETL process was often a big bottleneck). Neither argument is particularly surprising, although there are examples of successful PB-sized warehouses in the industry.

One of the most interesting parts of the paper was a laundry list of things they had to fix in MapReduce to get better performance for SQL:
  • Workerpools: These are essentially long running processes that do various parts of the MapReduce job (map task, reduce task, job coordinator, etc). Having a pool of running processes makes latencies lower than they would be if you had to launch a binary for each task in the job. This is certainly the case with JVM launches in Hadoop. Hadoop gets part of this done with reusable JVMs. The tradeoff, of course, is that fault isolation becomes a messier proposition.
  • Streaming and In-Memory Chaining: Allows two MapReduce jobs to communicate without temping to disk (GFS). I wonder if this can be done easily with just some InputFormat/OutputFormat magic ... I suspect this is do-able with some thought. Memory-chaining allows a mapper and a reducer to be co-located in the same process. This is probably going to be a bit harder to do in Hadoop.
  • Sort Avoidance: This feature allowed you to tell MapReduce to shuffle, but not sort. I've seen the need for this in many applications. Again, makes perfect sense for Hadoop also.
  • Block Shuffle: For smaller rows, when sorting is not needed, the block shuffle reduces overheads in the shuffle phase. This is a performance opportunity opened up by sort avoidance.
The other interesting bit was the section on their execution engine. Their SQL-Sawzall implementation was slow because of all the deserialization and serialization costs associated with moving data in and out of Sawzall's type system. The second version used Dremel's SQL expression evaluation engine -- this was better than the Sawzall engine, but still had inefficiencies. The most promising version was the LLVM-based experimental engine that could operate natively on columnar data, and this, not surprisingly, provided the best performance.

What does this mean for the Apache Hadoop ecosystem? I'm guessing the biggest performance opportunities for Hive currently lie in making the execution engine more CPU efficient. In fact, an experimental branch of Hive that does this would probably be a really fun open-source project right now. Hadoop will likely incorporate some of the performance improvements described in Tenzing over time, and Hive should be able to ride those when they become available.