Wednesday, May 13, 2015

Google Could Bigtable : NoSQLaaS

I'm very excited that Google announced Cloud Bigtable. You get a managed instance of Bigtable with an open-source API (there's an HBase client) at a great price! Bigtable's HBase APIs are fairly simple and easy to use, and we seem to have mapped nearly all of the features in the client to Bigtable.

Combined with Google Cloud Dataflow, which lets you write Flume jobs over your data for analytics, I think Google's Cloud Platform is very compelling for data intensive applications. I'm a huge fan of Flume, and am absolutely convinced it provides a huge boost to developer productivity compared to writing direct mapreduces. I think the platform finally has all the pieces needed for big data applications far quicker than rolling-your-own Hadoop+Mapreduce+HBase+Hive/Pig/Scalding stack in the cloud. I'm excited to see how developers will use this platform.

Tuesday, April 28, 2015

On the dangers of AUC

In most applied ML projects I've been involved in people rarely ever use AUC (the Area Under the ROC Curve) as a performance measure for a classifier. However, it is very commonly reported  in many academic papers. On the surface, the AUC has several nice properties: it is threshold free; so you can compute AUC without having to pick an classification threshold, it is insensitive to changes in the ratio of positive to negative examples in the evaluation set, and it has the nice property that a random classifier has an AUC of 0.5.  

It is easy to see that AUC can be misleading when used to compare two classifiers if their ROC curves cross. Classifier A may produce a higher AUC than B, while B performs better for a majority of the thresholds with which you may actually use the classifier. And in fact empirical studies have shown that it is indeed very common for ROC curves of common classifiers to cross. There are also deeper reasons why AUC is incoherent and therefore an inappropriate measure (see references below).

In this post however, I'll describe the common scenario of training a classifier on highly imbalanced problems (where the data is such that one class has many more examples than another). We only discuss binary classification in this post, so there are only two classes. The AUC measure, is of course, insensitive to class imbalance, but produces rather misleading results. Consider a problem where we have a large number of negative examples and a small number of positive examples. Further, a large number of negative examples are “easy” -- that is you can get most of them right without much effort. This is a fairly common situation with problems like spam/fraud detection in various domains. In such a situation, a classifier that gets the “easy” negatives right, but offers random predictions for the rest, actually does really well on AUC:

(Example below uses the AUC  library in R)
n <- 5000
# For this scenario, we use 80% easy negatives (value 0), and the remaining 20% are evenly balanced between positives and negatives (values 1, and -1).
d <- sample( c(numeric(8), -1, 1), n, replace=TRUE)

# Labels maps all negatives to 0, all positives to 1.
labels <- factor((d > 0) * 1)

# We predict all easy negatives correctly, and random predictions for the rest.
rand_preds <- runif(n) * abs(d)

# All easy negatives correct, better predictions for rest with some noise.
good_preds <- d*.1 + rand_preds

# All easy negatives correct. Better, less noisy predictions for the rest.
better_preds <- d*.3 + rand_preds
Now examine the AUC-ROC for these predictors:
> auc(roc(rand_preds, labels))
[1] 0.9430441
> auc(roc(good_preds, labels))
[1] 0.9628872
> auc(roc(better_preds, labels))
[1] 0.9896737


A couple of observations:
  1. While technically ‘rand_preds’ only gets the easy examples right, and is no better than random on the ones that matter, it produces a very high AUC of 0.94.
  2. Because of the large number of true negatives, the differences between the random, good, and better predictors are very small (of the order of 2%-3% change in AUC).

If you increase the imbalance to be 98% easy negatives, and 2% hard positives and negatives, the AUC differences are even smaller:
# 98% easy negatives:
d <- sample( c(numeric(98), -1, 1), n, replace=TRUE)

> auc(roc(rand_preds, labels))
[1] 0.995519
> auc(roc(good_preds, labels))
[1] 0.9974802
> auc(roc(better_preds, labels))
[1] 0.9996038

If your problem-specific metric is something like Precision@Recall=80%, these classifiers behave *very* differently:

#Plot Precision-Recall curves (using the ROCR library)
plot(performance(prediction(better_preds, labels), "prec", "rec"), col='green')
plot(performance(prediction(good_preds, labels), "prec", "rec"), col='blue', add=TRUE)
plot(performance(prediction(rand_preds, labels), "prec", "rec"), col='red', add=TRUE)


prcurve.png
Precision-Recall Curves for the three classifiers

Precision@Recall=80% for the ‘random’ predictor is ~0.5 (as expected), for the ‘good’ predictor is ~0.65 and the ‘better’ predictor is ~0.9. These differences are huge compared to the AUC numbers which only differ in the third significant digit.

Of course, everybody would recommend using a problem-specific measure ahead of a more generic measure like AUC, the point of this post is to argue that even if you're only using AUC for a quick-and-dirty comparison you shouldn't look at a high value in imbalanced problems and be impressed with the performance.

Some interesting references:
Jesse Davis and Mark Goadrich. The relationship between Precision-Recall and ROC curves. ICML 2006

D.J. Hand, C. Anagnostopoulos, When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance? Pattern Recognition Letters, Volume 34, Issue 5, 1 April 2013, Pages 492–495

David J. Hand, Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning October 2009, Volume 77, Issue 1, pp 103-123

Tuesday, February 10, 2015

WSDM 2015

I recently got back from WSDM which ended up being a very interesting conference with some great talks.

Keynotes

I enjoyed all three keynotes. Mike Franklin's talk focused on AMPLab's research progress and the steady stream of artifacts they've assembled into the Berkeley Data Analytics Stack (BDAS -- pronounced "bad-ass" :)). In addition to Spark, the stack now includes SparkSQL, GraphX, MLLib, Spark Streaming, and Velox. All of these projects pose interesting systems questions and have made a lot of progress on the kinds of tooling that will come in handy for large scale data analysis beyond SQL queries.

Lada Adamic's talk was a fun visual tour through lots of interesting analyses of Facebook data that her team has been running. In particular, she's been using the data to understand what makes a particular piece of content rapidly gain a lot of popularity (or go "viral"). "Sharing" behavior is unique to social networks -- traditional web properties have spent lots of energy understanding what makes people click. The "share" action is a completely different beast to study. Lada reported that they have had some success predicting if a piece of content, once it reaches K users, if it will get to 2K. The prediction accuracy improves as we get to larger K. Unfortunately, she reported that the features that predict virality mostly have to do with the speed at which it is spreading, so we don't yet have a handle on how best to craft viral content

Thorsten Joachims had a technical keynote that followed the arc of his previous work on learning to rank in the context of search. He talked about careful design of interventions in interactive processes to elicit information that can be suitably leveraged by a machine learning algorithm to improve ranking. Lots of examples with evidence from experiments he ran with the arxiv search prototype that have since been reproduced by Yahoo, Baidu, etc.

Research Talks

Here's a subset of research talks that I thought were fun and interesting:

Thursday, January 22, 2015

The Tail at Scale

Another blog post recommending a paper to read ....

One of the challenges of building large systems on shared infrastructure like AWS and the Google Cloud is that you may have to think harder about dealing with variations in response time than you might if you were designing for dedicated hardware (like a traditional data warehouse).  This is obviously critical when you're building latency-critical services that power interactive experiences (websites, mobile apps, video/audio streaming services, etc.). Depending on the operating point, you're likely to see this effect even with batch processing systems like Map-Reduce and large scale machine learning systems.

A CACM article from Jeff Dean and Luiz Andre Barosso -- The Tail at Scale contains many valuable lessons on how to engineer systems to be robust to tail latencies. Some of the interesting ideas from the paper are to use:

  1. Hedged requests: The client sends a request to one replica, and after a short delay, sends a second request to another replica. As soon as one response is received, the other request is cancelled.
  2. Tied requests: The client sends a request to two replicas (also with a short delay) making sure the requests have metadata so that they know about each other. As soon as one replica starts processing a request, it sends a cancellation message to the other replica (this keeps the client out of the loop for the cancel-logic). Proper implementations of hedged requests and tied requests can significantly improve the 99th percentile latency with a negligible (2% - 3%) increase in the number of requests sent.
  3. Oversharding/Micro-Partitions: The portion of data to be served is divided into many more shards than there are machines to serve them. Each server is assigned several shards to manage. This way, load balancing can be managed in a fairly fine-grained manner by moving a small number of shards at a time from one server to another. This allows us better control in managing the variability in load across machines. Systems like Bigtable (and open-source implementations like HBase), for instance, have servers that each manage dozens of shards.
There are lots of other ideas and interesting discussions in the paper. I strongly recommend it!

Tuesday, December 30, 2014

Machine Learning in Practice

A bunch of experienced Google engineers recently published a paper at a NIPS workshop titled "Machine Learning: The High Interest Credit Card of Technical Debt". This is a great paper that distills several dozen years worth of experience in building and maintaining a large-scale, complex, industrial-strength machine-learning system. If you haven't seen this paper I highly recommend it to get a sense for the patterns and anti-patterns they've observed along with supporting anecdotes.

The perspective is very different from a typical  ML paper -- you're not going to see a new model or a clever optimization strategy. The focus is mostly on the places where you have to go beyond the generally understood good software engineering practices to make the job of operating a large-scale ML system manageable. It's a short and fun read. Even if you've run an industrial ML system for a few years, you'll likely learn a trick or two from this paper.

Thursday, July 31, 2014

Data management and Machine Learning -- Picking off Feature Engineering

Over the last few years, there have been many interesting efforts from the data management research community to support various kinds of machine learning workloads on large data sets. Projects like MLBase, MADLib, SystemML, ScalOps on Hyracks, as well as systems like Spark, Mahout, have surfaced many interesting ideas.

I think a recent line of work in the database community that focuses on the "feature engineering" phase of a machine learning project is particularly promising. Brainwash, outlined at CIDR in 2013 and Columbus, described at SIGMOD this year argue that the "feature engineering" pipeline that needs to be built to feed any particular learning algorithm tends to take up the majority effort in any machine learning project. In fact, I've heard many practitioners estimate the feature engineering and "pipeline work" to consume about 90% of the resources (time, people) in a machine learning project. This includes getting all the features together in one place, gathering statistics about the features, possibly normalizing them, and making sure any transformations applied correctly before handing it over to the learning algorithm.

Columbus  provides a library of functions in R to build up a feature selection pipeline. In return for expressing the feature selection workflow using the interface provided by this library, Columbus promises several optimizations by reasoning about 1) user-specified error tolerance, 2) sophistication of learning task, and 3) amount of reuse. In experiments, the naive execution strategy was more than an order of magnitude slower than the plan Columbus picked for the program. The optimizations exploit well-known opportunities -- view materialization from databases, sophisticated sampling of input data (constrained by target accuracy) to reduce overall work, and using QR factorization to solve repeated least-squares problems efficiently.

There are some (non-technical) flaws in the paper -- while some enterprise feature-engineering workloads may look like what they outline, things look very different in internet companies that repeatedly solve learning problems on large datasets. (Besides, R is often not the tool of choice for large learning problems, so this is a bit of a random nit.)  For ad-hoc machine learning tasks in enterprise settings, I can see this being a reasonable fit. Also with Columbus, are you still writing your feature-engineering code in R if you're using the library functions provided? You have to pass in selection predicated in strings parsed/interpreted by Columbus -- you're now limited not by what you can do in R, but what support is built into Columbus for feature engineering. Is this expressive enough to be broadly useful without having to hack up the R parser? (On a side-note, a DSL friendly language like Scala might offer some nice advantages here, but then you're no longer in an R console.)

I do think the choice of focusing on the feature-engineering workflow is brilliant, and perhaps the place where the data management community can add the most value. Broader "declarative framework for machine learning" approaches are riskier (even when they make the sensible choice of doing a domain specific language in a DSL-friendly host-language) and may take much longer to have an impact. I'm excited to see where this line of research leads.

I'd recommend reading the Columbus paper. It won SIGMOD's best paper, and is a fun read. There are several ideas here that might be broadly useful to the more ambitious systems that are trying to support full fledged machine learning workflows.

Wednesday, March 26, 2014

Recommendations: Why RMSE can be misleading

I've recently spent some time working on machine learning and recommendation engines. If you're building a recommendation engine, you're typically trying to optimize for some metric of "goodness" for the user. In a Netflix-like setting, it could be how much time does a user spend watching the content you recommended? Picking good offline metrics (without actually watching how the user is responding) can be really tricky. The RMSE (Root Mean Square Error), the staple of many research papers, can be particularly misleading in many cases.

Assume, we have 100 items that a user rates between 1 to 5 stars (much like in the Netflix problem). For simplicity, assume that the first three items have 5-star ratings, and the rest have a 1-star rating.

Product True Rating Algo. A Predictions Algo. B Predictions
P001 5 2 1
P002 5 2 1
P003 5 2 1
P004 1 2 1
... ...2 1
P100 1 2 1

Consider Algorithm A that predicts that all the ratings will be 2. The RMSE for this dataset = sqrt((97 + 27)/100) = 1.11. Now consider Algorithm B that predicts all ratings to be 1. The RMSE for this dataset is sqrt(48/100) = 0.693. Algorithm B produced a huge improvement in RMSE over algorithm A, but is it really any better at differentiating between the items that the user liked vs. ones that she didn't? If you are going to use the recommendations to solve a ranking problem, RMSE is a pretty useless measure in this context. A better metric would capture the fact that you're trying to use the recommendations to display the best items to the users, hoping that the user clicks/buys/watches/engages with/likes what you recommended. Being accurate on items way beyond the top few that the user is likely to engage with is not very useful at all.

Now on the other hand, if the "rating" we have is binary -- say someone "likes" the movie or not -- say 1 or 0. (In reality there's a third state, where someone watches a movie, and then doesn't rate it. You could can map this state to a 1 or 0 with a few application-specific assumptions). With a binary rating, the RMSE simply counts how many predictions you got right. Because what we really have here is a classification problem, and not a ranking problem, RMSE ends up being more reasonable.

There are several papers that talk about vastly superior metrics for ranking (that actually work in practice!) that I'll try and describe them in future posts.