Genome sequencing is still one of the major problems in bio-informatics where one has to deal with large amounts of data. As advances in technology make it progressively cheaper to sequence an entire genome, scientists are coming up with many interesting questions they can tackle with sequencing.
I had assumed that the computational aspects of sequencing (or more specifically, re-sequencing) was a mostly solved problem that didn't really need much attention from computer scientists anymore. (I haven't spent much time or energy in this space since my grad school days.) Turns out there have been some very interesting developments in this space in the last three years.
Here's the basic problem definition: you get a tissue sample from someone and stick it in a machine. The machine spits out many "reads" that are short strings over the alphabet {A,C,G,T} -- each of these reads is the sequence for some portion of the DNA in the sample. These are noisy measurements, so there may be occasional errors. Over a few hours, the machine spits out many millions of reads. Now, the computational task is to build up the best guess for the whole sequence from which these reads arose given a reference sequence that is "pretty close".
The general idea that many algorithms follow is to align these reads to the reference sequence while tolerating a few mismatches. Then, for each position in the reference sequence, looking up the consensus of all the aligned reads to figure out the base (A, C, G, T) at that position. There are many variants to this approach, some use indexes on the reference sequence, some use indexes on the reads, some use heuristic based approximate alignments, some use a relatively expensive Smith-Waterman style alignments, sometimes there are multiple candidate reference sequences, and the reads themselves could come with some additional constraints (paired-end reads).
Here's a list of recent papers/projects dealing with the basic version of this problem:
SNAP is the newest one out of Berkeley, MSR, and UCSF and promises to be more than 10x faster than the competition. Apart from the ones listed above, there are many proprietary algorithms that come with the sequencing machine you buy. And of course, there's a host of older algorithms that the papers above built on.
There are some interesting data management aspects to this problem beyond just sequencing: different people care to different degrees about the actual alignment method used and how much data is retained. Often, the interesting analyses come after the alignment and SNP-identification, may only have to go back as far as the aligned sequence. Some scientists may want to go back all the way to the reads once they have completed some downstream analysis and want to verify the significance of their findings. If you're a big hospital, and are sequencing thousands of patients -- managing the reads, the aligned sequences, and the results of downstream experiments could easily be a multi-petabyte problem.
No comments:
Post a Comment