Saturday, February 3, 2018

ICDE 2018: Recommendations for All : Solving Thousands of Recommendation Problems Daily

I'm excited to share that our paper on scaling a recommendation servicewas accepted to ICDE 2018! An ML+Systems perspective.
Recommender systems are a key technology for many online services includinge-commerce, movies, music, and news. Online retailers use product recommender systemsto help users discover items that they may like. However, building a large-scaleproduct recommender system is a challenging task. The problems of sparsity andcold-start are much more pronounced in this domain. Large online retailers have usedgood recommendations to drive user engagement and improve revenue, but the complexityinvolved is a roadblock to widespread adoption by smaller retailers.
In this paper, we tackle the problem of generating product recommendations for
tens of thousands of online retailers. Sigmund is an industrial-scale system forproviding recommendations as a service. Sigmund was deployed to production in early2014 and has been serving retailers every day. We describe the design choices that wemade in order to train accurate matrix factorization models at minimal cost.We also share the lessons we learned from this experience -- both from a machinelearning perspective and a systems perspective. We hope that these lessons areuseful for building future machine-learning services.

Wednesday, January 17, 2018

WWW 2018 : Hidden in Plain Sight – Classifying Emails Using Embedded Image Contents

I'm excited to share that my team's paper on classifying emails using embedded image contents has been accepted to WWW 2018 (aka The Web Conference 2018).

In this paper, we tackle the problem of extracting information from commercial emails promoting an offer to the user.  The sheer number of these promotional emails makes it difficult for users to read all these emails and decide which ones are actually interesting and actionable.  Extracting information  enables an email platform to build several new experiences that can unlock the value in these emails without the user having to navigate and read all of them. For instance, we can highlight offers that are expiring soon, or display a notification when there’s an unexpired offer from a merchant if your phone recognizes that you are at that merchant’s store.

A key challenge in extracting information from such commercial emails is that they are often image-rich and contain very little text. Training a machine learning (ML) model on a rendered image-rich email and applying it to each incoming email can be prohibitively expensive. In this paper, we describe a cost-effective approach for extracting signals from both the text and image content of commercial emails in the context of a free email platform that serves over a billion users around the world. The key insight is to leverage the template structure of emails, and use off-the-shelf OCR techniques to obtain the text from images to augment the existing text features offline. Compared to a text-only approach, we show that we are able to identify 9.12% more email templates corresponding to ~5% more emails being identified as offers. Interestingly, our analysis shows that this 5% improvement in coverage is across the board, irrespective of whether the emails were sent by large merchants or small local merchants, allowing us to deliver an improved experience for everyone.

Thursday, July 6, 2017

KDD 2017 Teaser Video and Paper

Check out the teaser video for our paper at KDD 2017!  The PDF of the paper is available here at the Google Research web site.

Tuesday, May 30, 2017

Drive Quick Access at KDD 2017

Our paper, "Quick Access: Building a Smart Experience for Google Drive" was accepted for presentation at KDD 2017. The development of the Quick Access feature illustrates many general challenges and constraints associated with practical machine learning such as protecting user privacy, working with data services that are not designed with machine-learning in mind, and evolving product definitions. We believe that the lessons learned from this experience will be useful to practitioners tackling a wide range of applied machine-learning problems. We're looking forward to sharing this with the KDD community.

Friday, March 10, 2017

Quick Access on Drive: How Machine Learning can save you time

Check out my blog post on the Google Research blog about how ML saves you time with Quick Access on Drive.
Quick Access on Drive (Web and iOS)

Thursday, September 29, 2016

Quick Access: Machine Learning in Google Drive

I'm very excited about a new feature in Drive that our team in Research/Machine Intelligence helped launch today: Quick Access for Drive.

Quick access uses Google Machine learning on your Drive Activity, and your daily interactions to bring you the documents you need before you try to search/navigate to it. Our user-research shows that it shaves 50 percent off the average time it takes to get to the right file. Quick Access is available for G Suite customers on Android. If you're interested in all the new intelligent features in Google Apps, check out this post from Prabhakar Raghavan, the VP for Google Apps.

Thursday, August 20, 2015

UDF Centric Workflows

A recent PVLDB paper titled "An Architecture for Compiling UDF-Centric Workflows" describes some pretty exciting advances in the state of the art for data analysis. Using a system called Tupleware, the authors of the paper show that for datasets that fit into the memory of a small cluster, it is possible to extract far more performance from the hardware than is commonly provided on platforms like Spark. For instance, on a K-Means task, Tupleware was faster than Spark by 5X on a single node, and by nearly 100X on a 10-node cluster. For small datasets, that probably means a 1000x (3 orders of magnitude) improvement over Hadoop.

The dramatic performance improvement come from leveraging several known techniques from the literature on main-memory database systems and compilers. Tupleware relies on combining both high-level optimizations common in relational optimizers (projection and selection pushdown, join order optimization) with optimizations considered by recent main-memory systems like pipelining operators vs. vectorized simple operators. Several techniques and heuristics are described in the paper that produce this combined performance improvement.

Tupleware generates and compiles a program for each workflow, and leaning on the LLVM compiler framework lets it use many techniques like inlining UDFs, SIMD vectorization, generated implementations for context variables, etc. In the distributed setting, Tupleware communicates at the level of blocks of bytes while Spark uses fairly expensive Java serialization that consumes a fair fraction of the runtime for distributed execution. Tupleware does not describe any support for fault-tolerance, but with the huge performance improvement over Spark for many applications, simply restarting the job may be reasonable until we get to a certain job size.

Many researchers have pointed out that the fault-tolerance trade-offs that are right for analysis of large datasets (hundreds of terabytes) are not the same as the ones for smaller datasets (few terabytes). Glad to see data management research that's highlighting this well beyond the "X is faster than Hadoop for T".