Thursday, July 6, 2017

KDD 2017 Teaser Video and Paper

Check out the teaser video for our paper at KDD 2017!  The PDF of the paper is available here at the Google Research web site.


Tuesday, May 30, 2017

Drive Quick Access at KDD 2017

Our paper, "Quick Access: Building a Smart Experience for Google Drive" was accepted for presentation at KDD 2017. The development of the Quick Access feature illustrates many general challenges and constraints associated with practical machine learning such as protecting user privacy, working with data services that are not designed with machine-learning in mind, and evolving product definitions. We believe that the lessons learned from this experience will be useful to practitioners tackling a wide range of applied machine-learning problems. We're looking forward to sharing this with the KDD community.

Friday, March 10, 2017

Quick Access on Drive: How Machine Learning can save you time

Check out my blog post on the Google Research blog about how ML saves you time with Quick Access on Drive.
Quick Access on Drive (Web and iOS)

Thursday, September 29, 2016

Quick Access: Machine Learning in Google Drive

I'm very excited about a new feature in Drive that our team in Research/Machine Intelligence helped launch today: Quick Access for Drive.



Quick access uses Google Machine learning on your Drive Activity, and your daily interactions to bring you the documents you need before you try to search/navigate to it. Our user-research shows that it shaves 50 percent off the average time it takes to get to the right file. Quick Access is available for G Suite customers on Android. If you're interested in all the new intelligent features in Google Apps, check out this post from Prabhakar Raghavan, the VP for Google Apps.

Thursday, August 20, 2015

UDF Centric Workflows

A recent PVLDB paper titled "An Architecture for Compiling UDF-Centric Workflows" describes some pretty exciting advances in the state of the art for data analysis. Using a system called Tupleware, the authors of the paper show that for datasets that fit into the memory of a small cluster, it is possible to extract far more performance from the hardware than is commonly provided on platforms like Spark. For instance, on a K-Means task, Tupleware was faster than Spark by 5X on a single node, and by nearly 100X on a 10-node cluster. For small datasets, that probably means a 1000x (3 orders of magnitude) improvement over Hadoop.

The dramatic performance improvement come from leveraging several known techniques from the literature on main-memory database systems and compilers. Tupleware relies on combining both high-level optimizations common in relational optimizers (projection and selection pushdown, join order optimization) with optimizations considered by recent main-memory systems like pipelining operators vs. vectorized simple operators. Several techniques and heuristics are described in the paper that produce this combined performance improvement.

Tupleware generates and compiles a program for each workflow, and leaning on the LLVM compiler framework lets it use many techniques like inlining UDFs, SIMD vectorization, generated implementations for context variables, etc. In the distributed setting, Tupleware communicates at the level of blocks of bytes while Spark uses fairly expensive Java serialization that consumes a fair fraction of the runtime for distributed execution. Tupleware does not describe any support for fault-tolerance, but with the huge performance improvement over Spark for many applications, simply restarting the job may be reasonable until we get to a certain job size.

Many researchers have pointed out that the fault-tolerance trade-offs that are right for analysis of large datasets (hundreds of terabytes) are not the same as the ones for smaller datasets (few terabytes). Glad to see data management research that's highlighting this well beyond the "X is faster than Hadoop for T".


Wednesday, August 19, 2015

Insurance and Self-Driving Cars

Metromile (a car insurance company that's pioneering a pay-as-you-go model) recently produced an analysis of what their monthly insurance rates would be for a self-driving car based on the accident stats that Google released. Check out the expected savings on a variety of models:


No surprise that insurance companies like self-driving cars -- for the kinds of applications we've seen these cars used for, they are much safer than the typical (distracted) American driver. This is one of those world-changing technologies I'm eagerly looking forward to!

Wednesday, May 13, 2015

Google Could Bigtable : NoSQLaaS

I'm very excited that Google announced Cloud Bigtable. You get a managed instance of Bigtable with an open-source API (there's an HBase client) at a great price! Bigtable's HBase APIs are fairly simple and easy to use, and we seem to have mapped nearly all of the features in the client to Bigtable.

Combined with Google Cloud Dataflow, which lets you write Flume jobs over your data for analytics, I think Google's Cloud Platform is very compelling for data intensive applications. I'm a huge fan of Flume, and am absolutely convinced it provides a huge boost to developer productivity compared to writing direct mapreduces. I think the platform finally has all the pieces needed for big data applications far quicker than rolling-your-own Hadoop+Mapreduce+HBase+Hive/Pig/Scalding stack in the cloud. I'm excited to see how developers will use this platform.