I'm excited to share that my team's paper on classifying emails using embedded image contents has been accepted to WWW 2018 (aka The Web Conference 2018).
In this paper, we tackle the problem of extracting information from commercial emails promoting an offer to the user. The sheer number of these promotional emails makes it difficult for users to read all these emails and decide which ones are actually interesting and actionable. Extracting information enables an email platform to build several new experiences that can unlock the value in these emails without the user having to navigate and read all of them. For instance, we can highlight offers that are expiring soon, or display a notification when there’s an unexpired offer from a merchant if your phone recognizes that you are at that merchant’s store.
A key challenge in extracting information from such commercial emails is that they are often image-rich and contain very little text. Training a machine learning (ML) model on a rendered image-rich email and applying it to each incoming email can be prohibitively expensive. In this paper, we describe a cost-effective approach for extracting signals from both the text and image content of commercial emails in the context of a free email platform that serves over a billion users around the world. The key insight is to leverage the template structure of emails, and use off-the-shelf OCR techniques to obtain the text from images to augment the existing text features offline. Compared to a text-only approach, we show that we are able to identify 9.12% more email templates corresponding to ~5% more emails being identified as offers. Interestingly, our analysis shows that this 5% improvement in coverage is across the board, irrespective of whether the emails were sent by large merchants or small local merchants, allowing us to deliver an improved experience for everyone.