The data management community is increasingly interested in helping the larger scientific community organize, query, and process large data sets. I got to work on a really interesting bioinformatics problem this summer that involved a significant data processing and computational aspect. In the abstract, the problem is a learning task -- there are several items in a dataset, and each item has millions of features. Assume that each feature is categorical and can take one of three values. Each item is also associated with a set of "observations". Observations can be either categorical (eg. disease/no disease) or continuous. The objective is to predict the observations for unseen items.

This turns into a difficult model selection problem and various constraints are used from domain knowledge to reduce the space of possible models. For instance, consider the assumption that an observation can be explained using exactly one feature. Each of the features can be tested to see how well they can predict the observation using some learning technique (decision trees, regression, SVMs, etc.). If we find a good predictor, we also need to calculate a P-value. That is, the likelihood that such a good fit could have risen by chance. Furthermore, we may have to correct for the fact that we just evaluated millions of hypotheses (features). Multiple hypothesis testing in the statistical literature deals with assessing significance when many different statistical tests are performed. One of the techniques used ,when there's no easy closed-form technique for correcting for multiple hypotheses is to use permutations -- ie, the null hypothesis is that the observations have nothing to do with the actual features, so you hold the items in place, and randomly permute the observations. Any feature that is significant now is purely by chance. Repeating this computation a million times and finding only 1 good fit can be used to infer that the P-value is around 1 in a million, which is pretty significant.

The data sizes for this problem are small enough (in the tens or hundreds of gigabytes) that it can be solved on a single machine. But the processing demands are large enough that it makes sense to parallelize the solution. A traditional approach to solving this problem would have used an HPC-like setting. I worked with domain scientists to build a solution using several open source components including Hadoop, R, and RHIPE. In the coming weeks, I'll talk about what we learned from this approach -- what worked for the domain scientists, and what didn't. What parts helped increase the scientist's productivity and what hampered it.