Wednesday, March 26, 2014

Recommendations: Why RMSE can be misleading

I've recently spent some time working on machine learning and recommendation engines. If you're building a recommendation engine, you're typically trying to optimize for some metric of "goodness" for the user. In a Netflix-like setting, it could be how much time does a user spend watching the content you recommended? Picking good offline metrics (without actually watching how the user is responding) can be really tricky. The RMSE (Root Mean Square Error), the staple of many research papers, can be particularly misleading in many cases.

Assume, we have 100 items that a user rates between 1 to 5 stars (much like in the Netflix problem). For simplicity, assume that the first three items have 5-star ratings, and the rest have a 1-star rating.

Product True Rating Algo. A Predictions Algo. B Predictions
P001 5 2 1
P002 5 2 1
P003 5 2 1
P004 1 2 1
... ...2 1
P100 1 2 1

Consider Algorithm A that predicts that all the ratings will be 2. The RMSE for this dataset = sqrt((97 + 27)/100) = 1.11. Now consider Algorithm B that predicts all ratings to be 1. The RMSE for this dataset is sqrt(48/100) = 0.693. Algorithm B produced a huge improvement in RMSE over algorithm A, but is it really any better at differentiating between the items that the user liked vs. ones that she didn't? If you are going to use the recommendations to solve a ranking problem, RMSE is a pretty useless measure in this context. A better metric would capture the fact that you're trying to use the recommendations to display the best items to the users, hoping that the user clicks/buys/watches/engages with/likes what you recommended. Being accurate on items way beyond the top few that the user is likely to engage with is not very useful at all.

Now on the other hand, if the "rating" we have is binary -- say someone "likes" the movie or not -- say 1 or 0. (In reality there's a third state, where someone watches a movie, and then doesn't rate it. You could can map this state to a 1 or 0 with a few application-specific assumptions). With a binary rating, the RMSE simply counts how many predictions you got right. Because what we really have here is a classification problem, and not a ranking problem, RMSE ends up being more reasonable.

There are several papers that talk about vastly superior metrics for ranking (that actually work in practice!) that I'll try and describe them in future posts.

1 comment:

  1. Algo B performs better than algo A in 97 percent cases with better accuracy. so why would you say RMSE was wrong in coming up with far less error for B than A.