Tuesday, November 15, 2011

Big Data and Hardware Trends

One of the common refrains from "real" data warehouse folks when it comes to data processing on Hadoop is the assumption that a system built in Java is going to be extremely inefficient for many tasks. This is quite true today. However, for managing large datasets, I wonder if that's the right way to look at it. CPU cycles available per dollar is decreasing more rapidly than sequential bandwidth available per dollar. Dave Patterson observed this a long time ago. He noted that the annual improvement rate for CPU cycles is about 1.5x and the improvement rate for disk bandwidth was about 1.3x ... and there seems to be evidence that the disk bandwidth improvements are tapering off.

While SSDs remain expensive for sheer capacity, disks will continue to store most of the petabytes of "big data". A standard 2U server can only hold so many disks .. you max out at about 16 2.5" disks or about 8-10 3.5" disks. This gives us about 1500MB/s of aggregate sequential bandwidth. You can stuff 48 cores into a 2U server today. This number will probably be around 200 cores in 3-4 years. Even with improvements in disk bandwidth, I'd be surprised if we could get more than 2000MB/s in 3-4 years using disks. In the meantime, the JVMs will likely close the gap in CPU efficiency. In other words, the penalty for being in Java and therefore  CPU inefficient is likely to get progressively smaller.

There are lots of assumptions in the arguments above, but if they line up, it means that Hadoop-based data management solutions might start to get competitive with current SQL engines for simple workloads on very large datasets.The Tenzing folks were claiming that this is already the case with a C/C++ based MapReduce engine.


  1. A few thoughts about this topic:
    About disk I/O:
    A typical 2U server can have 24 2.5" disk or 12 3.5" disk, with SATA ~100M bandwidth or SAS ~150M bandwidth, this aggregate bandwidth can be 24*150M/s = 3600M/s.
    Another factor need to consider is lightweight compression, with algorithms like snappy or lz4, the compression speed can reach 400M/s, decompression speed 1500M/s, for a single thread, and compression ratio 2x-6x, actually hadoop related dataset has much better compression ratio, like web page, log, semi-structured data, column store data, lets say
    5x ratio, so the "virtual" bandwidth can reach 3600M/s*5 = 18000M/s.
    With compression enabled, Hadoop clusters are often CPU bound.
    About JVM:
    JVM is very fast now and will continue to get faster for normal workloads, but some low level optimizations are very hard to apply on JVM: like SIMD optimization in vectorwise? Or LLVM based execution engine in tenzing?
    About cluster scale:
    Most analytical workloads are TB scale for small companies, only a few large companies really need to scale to PB scale, with manycore processors and very dense disk storage, a commodity server in the near future can have the same capacity of today's small hadoop cluster, a very small(from 2U to a rack) cluster can handle 100T~2PB(consider compression) wordload, so optimizations for small Hadoop clusters should have higher priority than further scale-out Hadoop cluster.

    1. @Decster -- interesting comments. I do think that small Hadoop clusters and performance optimization at that scale will be useful, interesting, and very different from the stuff you do at the 10000-node scale. That said, I do think that many structured workloads at the ~1 TB scale easily fit on small appliances, often a single node one even today.