Sandeep's Research Notes

Tuesday, October 4, 2011

Tenzing: SQL on MapReduce

Google described Tenzing, their implementation of SQL on MapReduce at VLDB this year. The paper explains that they built it because their data-warehouse was getting too expensive and was not able to keep pace with the demands the users made -- in terms of complex queries (SQL was always getting in the way), and performance (the ETL process was often a big bottleneck). Neither argument is particularly surprising, although there are examples of successful PB-sized warehouses in the industry.

One of the most interesting parts of the paper was a laundry list of things they had to fix in MapReduce to get better performance for SQL:

Workerpools: These are essentially long running processes that do various parts of the MapReduce job (map task, reduce task, job coordinator, etc). Having a pool of running processes makes latencies lower than they would be if you had to launch a binary for each task in the job. This is certainly the case with JVM launches in Hadoop. Hadoop gets part of this done with reusable JVMs. The tradeoff, of course, is that fault isolation becomes a messier proposition.
Streaming and In-Memory Chaining: Allows two MapReduce jobs to communicate without temping to disk (GFS). I wonder if this can be done easily with just some InputFormat/OutputFormat magic ... I suspect this is do-able with some thought. Memory-chaining allows a mapper and a reducer to be co-located in the same process. This is probably going to be a bit harder to do in Hadoop.
Sort Avoidance: This feature allowed you to tell MapReduce to shuffle, but not sort. I've seen the need for this in many applications. Again, makes perfect sense for Hadoop also.
Block Shuffle: For smaller rows, when sorting is not needed, the block shuffle reduces overheads in the shuffle phase. This is a performance opportunity opened up by sort avoidance.

The other interesting bit was the section on their execution engine. Their SQL-Sawzall implementation was slow because of all the deserialization and serialization costs associated with moving data in and out of Sawzall's type system. The second version used Dremel's SQL expression evaluation engine -- this was better than the Sawzall engine, but still had inefficiencies. The most promising version was the LLVM-based experimental engine that could operate natively on columnar data, and this, not surprisingly, provided the best performance.

What does this mean for the Apache Hadoop ecosystem? I'm guessing the biggest performance opportunities for Hive currently lie in making the execution engine more CPU efficient. In fact, an experimental branch of Hive that does this would probably be a really fun open-source project right now. Hadoop will likely incorporate some of the performance improvements described in Tenzing over time, and Hive should be able to ride those when they become available.

Tuesday, September 20, 2011

Parallel Stochastic Gradient Descent

I've been learning about stochastic gradient descent (SGD) and why people seem to be excited about it in the context of building models over large datasets. Rainer Gemulla, Peter Haas, Erik Nijcamp, and Yannis Sismanis published a really cool paper at KDD this year that was my introduction to the problem.

At a high level. SGD is simply an optimization technique that can be used to minimize some objective function (such as a loss function). The main difference from traditional "batch" gradient descent is that SGD samples from the data at each step. That is, the model can be updated after evaluating the gradient of the loss function for a sample of the input data -- a sample that can be as small as a single point! SGD can achieve incredibly fast performance on a variety of machine learning tasks -- and often converges to a good solution much faster than conventional batch gradient descent. On large data sets, the performance difference could be an order of magnitude or more. There are, of course, lots of caveats about data size, learning rates, sampling techniques, and convergence rates, when it comes to SGD.

The KDD paper focuses on matrix factorization -- a technique that has received a lot of interest in the context of the "Netflix Problem". Given a sparse C by I matrix M of C customers and I items (say movies), where each entry in the matrix contains a rating, the problem is one of finding factors of the matrix A , B such that the matrix M' = A x B matches the existing values in M, and offers a prediction for the missing values in M. There are many techniques for solving this problem. However, as the matrix M gets larger, and you have billions of ratings from millions of customers, the problem becomes too big to solve on a single machine.

Rainer's KDD paper shows how you can use a distributed version of SGD to solve this problem using a cluster of machines. The central idea in the paper is a clever partitioning of the matrix so that each node in the cluster can work on a partition of the data and update the model (the factors A and B) without fine grained coordination with other nodes in the cluster. The bulk of the paper deals with proving why their partitioning strategy is statistically sound. The paper describes an implementation of this using R and Snowfall for small clusters. They also describe an implementation of this algorithm using Hadoop. The details of the implementation are tricky and complicated -- you need to achieve a certain partitioning of the data, there are certain communication patterns on the model that are not a natural fit for Hadoop resulting in some inelegant MapReduce code, and a certain amount of additional coordination is required for efficient performance. However, the results are impressive -- the Distributed SGD algorithm converges to a good solution faster and scales way better than alternative techniques.

As one can imagine, Netflix isn't the only company that needs to solve large recommendation problems. There are many large content publishers, marketplaces, and retailers working hard on good solutions to this problem. What's more, the ability to factorize large matrices is good not just for content recommendation. It can be used as a building block for clustering algorithms, topic detection, and even risk analytics. Neat stuff. This should enable some interesting new recommendation-style applications and provide alternate/faster implementations for existing ones.

Thursday, September 15, 2011

New Ideas In Datacenter Networking

I've recently been stumbling into interesting papers on new ways to wire up datacenters to guarantee properties like high bisection bandwidth without the need for expensive networking hardware like big second level switches. I don't normally read a lot about networking but some of the stuff I've been working on lately has given me a peek into the SIGCOMM, NSDI world ...

CamCube, BCube, and DCell all seem to be exploring ways of giving up the big expensive switch at the top of a tree-structured network in favor of multiple NICs on each server and/or multiple smaller switches connected in interesting topologies. I suspect this is still very early stage stuff, and many problems such as cabling, servicability, availability NIC-to-NIC transfer without burning CPU etc need to be solved before it becomes interesting to a practitioner. But the papers do offer an intriguing read for someone who normally follows the database and systems communities.

In the Hadoop/MapReduce context, this could change how we think about scheduling tasks, scheduling transfers when multiple MapReduce jobs are running. One of the papers actually talks about how Hadoop workloads could be affected by these new topologies. One criticism against these approaches in the context of Hadoop clusters is that they probably only get interesting for mid-to-large clusters. Monash reports that the median Hadoop cluster is about 30 nodes and that the average is about 200 nodes (numbers courtesy Omer Trajman of Cloudera). At these sizes, there are cheaper and easier ways to wire up a Hadoop cluster. Fun papers nevertheless!

Thursday, September 8, 2011

Spinnaker VLDB Slides

[edit] Here's the full presentation I used at VLDB:

Spinnaker VLDB 2011

View more presentations from sandeep_tata

Here's a subset of the slides I used at VLDB comparing Spinnaker with Bigtable, Dynamo, and PNUTS:

Sunday, September 4, 2011

VLDB 2011

I was at VLDB last week. I got to see a bunch of interesting talks and had several fun hallway conversations

Here are some notes from the conference:

Challenges and Vision Track

The Challenges and Vision track was probably the most fun. Two talks stood out to me (although I didn't attend all the talks in this session). Magdalena Balazinska's talk on Data Markets and Peter Haas' "Data is Dead -- Without What If Models". Magda's argument was that public clouds are turning out to be a very convenient place to buy and sell data and research is required to understand how data and data services can be priced and sold. She argued that the current tiered subscription models don't really cut it (I'm not sure I fully buy it, but it is an interesting proposition). She argued that this brings about many challenges at the intersection of databases and economics.

Peter Haas (who also works at IBM Almaden) argued that the database community has traditionally been good at descriptive statistics and shallow models, but will need to get good at supporting deep, predictive models to really unlock the value of all the data we're gathering. He gave examples from the healthcare world and the weather simulation world on how deeper models help us ask harder "what-if" questions. In effect he was challenging the data management community to reinvent itself as the "data and model management" community.

Traditional Database Research

Thomas Neumann of MPI gave a fun talk about compiled query processing for modern architectures.He argued that giving compilers a fighting chance to optimize your query plan after generating it in C++ (or LLVM) can be really good for query performance. In fact, with modern architectures, this might be required to get much better performance than the iterator pipelining model used in many commercial runtimes. Much of this builds on the observations that the MonetDB guys have made in the context of columnar processing techniques and the overheads of row-oriented operator pipelining-based runtimes, I expect this will be a good paper to read.

RemusDB, the best paper award winner, was an interesting talk that showed how VM-replication now performs well enough to be a way to provide High-Availability for any arbitrary database implementation. While this approach does make for a quick-and-dirty HA solution that deals with the primary failing, I'm not sure a VM level solution would work when the failed primary restarts and wants to rejoin the master-slave pair. Nevertheless, this will probably make a fun read for anyone interested in database systems.

HyPer -- a demo from some of the guys at TU-Munich , HYRISE -- from HPI and several others in academia have been playing with ideas for main-memory systems that can provide good OLTP performance while also supporting a reasonable OLAP workload. Their argument is that in most cases, you don't really need to have two separate systems -- one for OLTP and one for OLAP. I guess it is back to "one-size fits all" now eh?

There was also a cool paper on serializable snaphot isolation for replicated databases optimized for high update rates from folks at the University of Sydney and Seoul National University.

Industrial Talks

Google described Tenzing a SQL engine on MapReduce. The interesting part was that they worked with the MapReduce team quite a bit to make it possible to implement reasonably efficient join plans. The speaker flashed some slides that seemed to indicate that the per-node performance of Tenzing was "comparable" to a commercial database.

Ben Reed talked about "Inspector Gadget" -- a debugging tool for Pig that is currently in use at Yahoo! and has been open sourced as part of Pig 0.9.

Other Stuff

There were a bunch of interesting efforts around Hadoop and MapReduce that I'll write about in more detail later. There seemed to be some interest at the intersection of crowdsourcing and databases -- I'm still not sure I understand the opportunities there.

Tuesday, August 16, 2011

More on Disruptions

Another non-technical post. I was thinking through the amazing variety of industries in which web and related technologies have caused disruptive changes. There's a whole bunch of disruptions along the theme of monetizing the "long tail" --

Long Tail Disruptions
Existing Industry	Disruptive Technology	Companies
Media Advertising	Online Ad networks, exchanges	AdSense
Retail	Online Retail/Auctions	Amazon, eBay
Banking	P2P Lending through web-based apps	Lending Club
Hotels, B&B	P2P Space Sharing through web-based apps	AirBnB
Content Publishing	Self Publishing with hosted blogging platforms	Blogger, WordPress, etc.
Discounted High-End Brand Retail	Web, Social Media +Daily Deal Email	Gilt.com, Fab.com, ...
Tax Accounting	Web-based Apps	TurboTax
Financial Advice	Web-based Apps	Mint.com
Broadcast Television	Streaming Video	Netflix, Hulu

And now the IT industry itself is possibly undergoing several disruptive changes:

IT Disruptions
Middleware (DBs, AppServers, etc.)	PaaS	Amazon Web Services, Google AppEngine, Force.com, Heroku, SpringSource, CloudFoundry, Rackspace Cloud
Data Warehousing	MapReduce, Hadoop	Yahoo, Hortonworks, Cloudera, IBM, ...
Short Request Processing	NoSQL stores	DataStax, Couchbase, Grid vendors

Some of these disruptions are likely to have a small effect, and some of these are probably going to change the landscape of IT. It sure is an interesting time to be in technology!

Monday, August 8, 2011

HBase and Column Stores

I've been asked this a few times -- how would you compare the column store on Hadoop that we built (CIF) with HBase? HBase offers column families that are stored separately -- so, you should be able to store each column in a separate file if you put it in its own column family. This should work, and you should be able to get some amount of I/O elimination when scanning a small subset of columns from a much larger table, HBase introduces several inefficiencies in comparison to using the approach in our paper.

Path length: Reading from HBase instead of directly reading a file (or a set of files) from HDFS means each byte now flows through more code. Data gets read off the disk by the DataNode process in HDFS, then it moves to the DFSClient in the HBase region server, and finally moves to the (key,value) pair returned by the InputFormat that reads using the HBase client object. These additional steps the data has to travel certainly adds overhead.
Read logic: The read logic requires checking in at least two places before a row can be returned -- the SSTable on disk, and the state in memory in the Memtable (to use Bigtable terms). This is also additional overhead compared to simply reading the bytes off HDFS.
Schema free: HBase wasn't designed to be used with fixed schemas for a table -- as a result, every row that is stored also stores the names of every column in that row. Knowing a schema allows CIF to store the metadata separately. While in HBase the repetition of the column names can be ameliorated by compressing away the SSTables, this still has a cost in terms of CPU cycles spent decompressing this data while reading. Again, more unnecessary overheads.
Load speeds: If you aren't using a special bulk-load API and instead use the standard insert API to insert a new batch a single row-at-a-time, you'll send all the data twice over the network -- once to the logs, and a second time when the SSTables get replicated. If you're network bound during load -- and you should be if you are replicating, and you're not doing any crazy processing -- you'll be at half the load pace as you could get to with CIF.
MapReduce Outputs: This one's a little tricky -- it is not clear how to make sure that an OutputFormat that takes the results of a MapReduce job and inserts it into HBase in a way that is compatible with MapReduce's fault-tolerance assumptions. If the OutputFormat inserts key-values directly into HBase, and then halfway through, the task fails -- there's no way to "roll-back" the stuff already inserted into HBase. The last I checked, you couldn't do an arbitrary insert sequence in an atomic way with HBase. There may be application specific ways out of this -- but with CIF, you simply end up writing a new set of files -- if the task fails, you delete them as part of the cleanup. Easy.

While HBase could be used as a columnar storage layer for scan-heavy applications, it would probably be a lot slower. However, if you need to support, point lookups, updates, and deletes -- this cost is justifiable. For applications that are append-only, and don't need to deal with point updates, CIF provides an alternative with a much higher performance.

An easy way to verify this would be to run a simple benchmark and measure the scan performance of CIF and HBase:-) I'll try and do that when I get a chance and post the results.