Assume, we have 100 items that a user rates between 1 to 5 stars (much like in the Netflix problem). For simplicity, assume that the first three items have 5-star ratings, and the rest have a 1-star rating.
Product | True Rating | Algo. A Predictions | Algo. B Predictions |
---|---|---|---|
P001 | 5 | 2 | 1 |
P002 | 5 | 2 | 1 |
P003 | 5 | 2 | 1 |
P004 | 1 | 2 | 1 |
... | ... | 2 | 1 |
P100 | 1 | 2 | 1 |
Consider Algorithm A that predicts that all the ratings will be 2. The RMSE for this dataset = sqrt((97 + 27)/100) = 1.11. Now consider Algorithm B that predicts all ratings to be 1. The RMSE for this dataset is sqrt(48/100) = 0.693. Algorithm B produced a huge improvement in RMSE over algorithm A, but is it really any better at differentiating between the items that the user liked vs. ones that she didn't? If you are going to use the recommendations to solve a ranking problem, RMSE is a pretty useless measure in this context. A better metric would capture the fact that you're trying to use the recommendations to display the best items to the users, hoping that the user clicks/buys/watches/engages with/likes what you recommended. Being accurate on items way beyond the top few that the user is likely to engage with is not very useful at all.
Now on the other hand, if the "rating" we have is binary -- say someone "likes" the movie or not -- say 1 or 0. (In reality there's a third state, where someone watches a movie, and then doesn't rate it. You could can map this state to a 1 or 0 with a few application-specific assumptions). With a binary rating, the RMSE simply counts how many predictions you got right. Because what we really have here is a classification problem, and not a ranking problem, RMSE ends up being more reasonable.
There are several papers that talk about vastly superior metrics for ranking (that actually work in practice!) that I'll try and describe them in future posts.
Algo B performs better than algo A in 97 percent cases with better accuracy. so why would you say RMSE was wrong in coming up with far less error for B than A.
ReplyDelete