I was recently at the AMP Lab retreat. There’s a lot of interesting work going on at Berkely. Here are a few things that caught my attention:
- Datacenter OS
Lots of solid systems work under the general umbrella of figuring out how to do better resource management by looking at workload traces and driving improvements to scheduling, caching, some of my favorites were:
DRF (Dynamic Resource Fairness) A scheduling algorithm that guarantees fair allocation and is strategy proof – an interesting and perhaps useful alternative when optimizing utilization is not always the right thing to do. There was a nice presentation from Ali Ghodsi that described why this was interesting and the value proposition of DRF when compared to a market-based allocation technique.
· Orchestra and Memento: Orchestra is a bunch of techniques for scheduling large data transfers like broadcasts and shuffles to minimize job completion times. Memento is a globally coordinated caching strategy for Hadoop-like deployments to help speed up completion times for small jobs.
· Performance Isolation: Doing good performance isolation beyond “number of cores” and amount of memory without substantial overhead has been tricky. There are a bunch of projects under this umbrella looking at better ways to do this for other resources such as disk I/O, power, and even memory bandwidth! (A project called
RAMP?). Performance isolation at the level of memory bandwidth is probably going to open up a whole new set of application frameworks that can be supported at the datacenter level.
Spark is a deceptively simple project, and perhaps one of my favorites, that I think has just scratched the surface of what’s possible. Scala provides an interesting playground that could help bridge the gap between programming languages and query languages. Spark touches on some simple techniques that can be used in this space and provides a runtime that can use these techniques while also giving you a runtime that is Java-friendly (store de-serialized data , in-memory). Bagel is a neat implementation of BSP/Pregel APIs using the pieces in Spark. There’s some work on running SQL on RDDs for interactive queries: a quick-and-dirty way to get a toy version of Google’s Dremel. I think there are many interesting things to come in this space.
- New Application: Cancer Genomics
There were two very exciting presentations on the possibility of “big data” infrastructure helping find a cure for cancer! The first one was from David Haussler of UC Santa Cruz who talked about the data challenges in understanding cancer genomics. The second was from Taylor Sittler from UCSF. The interesting take-aways were that sequencing and SNP calling are not completely solved “easy” problems – there are interesting genetic variations, such as insertions, deletions, repeats that are still computationally expensive to discover from short-read sequencers. There’s plenty of computational work to do in understanding what changes are significant and why. The second talk from Sittler was also really fascinating – he identified two interesting applications:
· Automatically recommending a drug cocktail based on sequence information gathered from tumor and normal tissue. The analytics flow is assumed to have access to a drug target database which has information on what drugs affect what targets/pathways. A medical expert can use this list as a starting point for designing the treatment.
· Determining novel viruses: Looking at sequence data from sick individuals and classifying the reads into “human”, “known bacteria/viruses”, and “novel viruses”. This seemed somewhat easier than the previous problem, but perhaps I was missing something.