Thursday, July 31, 2014

Data management and Machine Learning -- Picking off Feature Engineering

Over the last few years, there have been many interesting efforts from the data management research community to support various kinds of machine learning workloads on large data sets. Projects like MLBase, MADLib, SystemML, ScalOps on Hyracks, as well as systems like Spark, Mahout, have surfaced many interesting ideas.

I think a recent line of work in the database community that focuses on the "feature engineering" phase of a machine learning project is particularly promising. Brainwash, outlined at CIDR in 2013 and Columbus, described at SIGMOD this year argue that the "feature engineering" pipeline that needs to be built to feed any particular learning algorithm tends to take up the majority effort in any machine learning project. In fact, I've heard many practitioners estimate the feature engineering and "pipeline work" to consume about 90% of the resources (time, people) in a machine learning project. This includes getting all the features together in one place, gathering statistics about the features, possibly normalizing them, and making sure any transformations applied correctly before handing it over to the learning algorithm.

Columbus  provides a library of functions in R to build up a feature selection pipeline. In return for expressing the feature selection workflow using the interface provided by this library, Columbus promises several optimizations by reasoning about 1) user-specified error tolerance, 2) sophistication of learning task, and 3) amount of reuse. In experiments, the naive execution strategy was more than an order of magnitude slower than the plan Columbus picked for the program. The optimizations exploit well-known opportunities -- view materialization from databases, sophisticated sampling of input data (constrained by target accuracy) to reduce overall work, and using QR factorization to solve repeated least-squares problems efficiently.

There are some (non-technical) flaws in the paper -- while some enterprise feature-engineering workloads may look like what they outline, things look very different in internet companies that repeatedly solve learning problems on large datasets. (Besides, R is often not the tool of choice for large learning problems, so this is a bit of a random nit.)  For ad-hoc machine learning tasks in enterprise settings, I can see this being a reasonable fit. Also with Columbus, are you still writing your feature-engineering code in R if you're using the library functions provided? You have to pass in selection predicated in strings parsed/interpreted by Columbus -- you're now limited not by what you can do in R, but what support is built into Columbus for feature engineering. Is this expressive enough to be broadly useful without having to hack up the R parser? (On a side-note, a DSL friendly language like Scala might offer some nice advantages here, but then you're no longer in an R console.)

I do think the choice of focusing on the feature-engineering workflow is brilliant, and perhaps the place where the data management community can add the most value. Broader "declarative framework for machine learning" approaches are riskier (even when they make the sensible choice of doing a domain specific language in a DSL-friendly host-language) and may take much longer to have an impact. I'm excited to see where this line of research leads.

I'd recommend reading the Columbus paper. It won SIGMOD's best paper, and is a fun read. There are several ideas here that might be broadly useful to the more ambitious systems that are trying to support full fledged machine learning workflows.