One of the problems I explored during the last few months at Almaden was secondary indexing in a scale-out NoSQL datastore. A solution I designed and implemented in collaboration with folks at IBM Watson has made it into the IBM BigInsights product and into EDBT 2014. Here's the full paper.
Secondary Indexes: Local and Global
There are two ways to think about secondary indexes in a distributed system -- "local indexes" and "global indexes". A local index simply indexes the data within each node. If you want to find all the rows which have attribute color set to 'red', each node in the system would be able to find the matching rows using a secondary index on the color column instead of doing a full scan. On the other hand, if you had a global secondary index on color, probing this would return all the keys which have color set to 'red'. This information is likely on just one node -- you don't have to query all the nodes in the system to locate the relevant keys.
Cassandra has has local secondary indexes for a while. HBase has had prototypes for both local and global indexes. See here for some discussion on possible implementations. Local indexes tend to be a little easier to design and implement since you don't have to deal with any notion of a distributed transaction. With a global index, you have to deal with the possibility that there might be a failure between updating the node with the base data and the node with the index data. Also, update latency for an indexed table gets much worse -- you need to update the base table, delete the old index entry, and update the index with the new entry. This translates to two updates and a read. For stores based on Bigtable-like tablets and a log-structured merge tree design, the read latency might even be worse than the write latency. Our paper describes a solution around both these problems for HBase : 1) guaranteeing atomicity in the face of failures, and 2) minimizing the impact to latency.
The core novel idea in the first approach, which we call async-simple is to implement indexing as an asynchronous operation. Update the base table right away, and respond, but then guarantee that within a "short while", the indexes will be consistent. That is, you have read-your-writes consistency for base data, but a more relaxed form of consistency for the index data. For many applications this is a reasonable trade-off that gets you very good latency guarantees. For the applications that need better consistency guarantees, we proposed async-session, which provides session consistency for the index data. That is, you'll be able to see the effects of your own updates within a session. We show that both of these techniques can be implemented while guaranteeing atomic index updates with nearly no impact on update latency. You don't need distributed transactions to get this right. Here's how we solved this -- we added two new components to the RegionServer an Asynchronous Update Queue (AUQ) and an Asynchronous Processing Service (APS). The AUQ is an in-memory data structure that temporarily stores all base updates that require index processing. These live in the same RegionServer as the one responsible for the base row. The APS is a background service that de-queues from the AUQ, inserts new values into the index and deletes the corresponding old values if present.
Recovery is guaranteed by the fact that 1) the WAL entry for the base update has enough information to re-construct the index updates and re-populate the AUQ during recovery, 2) we use the same timestamp for the index updates as for the base data, and 3) we ensure that the AUQ can always be reconstructed using the base WAL entries without resorting to a separate log for the AUQ. Property 3) requires careful coordination between the AUQ and rolling forward the WAL to ensure that it is not garbage collected (when memtables get flushed to disk) until the APS has drained all the corresponding entries from the AUQ. Note that during recovery, we may redo some of the index updates that were done prior to failure. However, this is okay because re-applying an update with an identical timestamp is an idempotent operation. (See Pat Helland's paper here for a nice discussion on how to use idempotent messaging to avoid distributed transactions.)
This means there will be moments in time when the index is not consistent with the base data, and that may be visible to clients. This is unavoidable without an expensive distributed locking or a Generalized Snapshot Isolation-style solution. A small amount of client logic is usually enough to deal with this inconsistency. For applications that do not want to deal with this, we implemented session-consistency where you get to see all the updates made within your session, but this may not hold across sessions. We were able to do this with minor changes to the HBase driver -- you essentially track the necessary state in the client driver. This works fine for small sessions (few KB of in-session updates), which is likely to be the case for most applications that run on HBase style datastores.
The implementation relies mostly on using CoProcessors. In fact, async-simple required nearly no changes to the core, while async-session required changes to the client library. I think a similar implementation can be done in Cassandra as well. The experiments in the paper are pretty basic and straightforward. The most interesting result is that the latency of updates with async-simple and async-session is nearly the same as an update (Put operation) on an un-indexed table.
The main concern is that the throughput caps off lower than the un-indexed updates, but that's understandable that the indexed-update is 2 Puts + a read while the base update is simply 1 update. With some straighforward tuning of the sizes of the AUQ and memtable flush frequency, and assuming relatively cheap reads, we should be able to get indexed updates with the same latency as base updates maxing out at roughly a third the throughput of base updates.