A recent PVLDB paper titled "An Architecture for Compiling UDF-Centric Workflows" describes some pretty exciting advances in the state of the art for data analysis. Using a system called Tupleware, the authors of the paper show that for datasets that fit into the memory of a small cluster, it is possible to extract far more performance from the hardware than is commonly provided on platforms like Spark. For instance, on a K-Means task, Tupleware was faster than Spark by 5X on a single node, and by nearly 100X on a 10-node cluster. For small datasets, that probably means a 1000x (3 orders of magnitude) improvement over Hadoop.
The dramatic performance improvement come from leveraging several known techniques from the literature on main-memory database systems and compilers. Tupleware relies on combining both high-level optimizations common in relational optimizers (projection and selection pushdown, join order optimization) with optimizations considered by recent main-memory systems like pipelining operators vs. vectorized simple operators. Several techniques and heuristics are described in the paper that produce this combined performance improvement.
Tupleware generates and compiles a program for each workflow, and leaning on the LLVM compiler framework lets it use many techniques like inlining UDFs, SIMD vectorization, generated implementations for context variables, etc. In the distributed setting, Tupleware communicates at the level of blocks of bytes while Spark uses fairly expensive Java serialization that consumes a fair fraction of the runtime for distributed execution. Tupleware does not describe any support for fault-tolerance, but with the huge performance improvement over Spark for many applications, simply restarting the job may be reasonable until we get to a certain job size.
Many researchers have pointed out that the fault-tolerance trade-offs that are right for analysis of large datasets (hundreds of terabytes) are not the same as the ones for smaller datasets (few terabytes). Glad to see data management research that's highlighting this well beyond the "X is faster than Hadoop for T".