Monday, May 21, 2012

ZLIB Compression Problem

I recently ran into a weird bug with the FIXED compression strategy in the Zlib library that ships with Hadoop 20.2. Here's a program that simply reads a file, compresses it, and then decompresses it back.

public static void main(String[] args) throws IOException {
        if (args.length < 1) {
            System.out.println("Usage: CompressionVerifier <filename>");
        }
        byte[] onebyte = new byte[1];
        String filename = args[0]; 
        FileInputStream fis = new FileInputStream(new File(filename));
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        while (fis.read(onebyte) != -1)
        {
            baos.write(onebyte);
        }
        byte[] data = baos.toByteArray();
        
        System.out.println("Compressing file: " + filename);
        //Now compress data.
        JobConf conf = new JobConf();
        DefaultCodec cc = ReflectionUtils.newInstance(DefaultCodec.class, conf);
        cc.setConf(conf);
        Compressor zcom;
        
        zcom = new ZlibCompressor(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION, 
                ZlibCompressor.CompressionStrategy.FIXED, // Causes error
                //ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY, //Works fine
                ZlibCompressor.CompressionHeader.DEFAULT_HEADER, 64 * 1024);
        
        
        baos.reset();
        CompressionOutputStream uncompressedByteStream = cc.createOutputStream(baos,zcom);
        uncompressedByteStream.write(data);
        uncompressedByteStream.close();
        baos.close();
        byte[] compressedData = baos.toByteArray();
        System.out.println("Finished compressing");
        
        DefaultCodec c2 = ReflectionUtils.newInstance(DefaultCodec.class, conf);
        c2.setConf(conf);
        CompressionInputStream inpStr = c2.createInputStream(new ByteArrayInputStream(compressedData));
        System.out.println("Starting decompression");
        while (inpStr.available() > 0)
        {
            inpStr.read();
        }
        System.out.println("Verified File!");
    }

On most inputs, it works fine, and the execution is uneventful. On certain inputs, you get an IOException for an "invalid distance code" --

Exception in thread "main" java.io.IOException: invalid distance code
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:62)
at com.ibm.utils.CompressionVerifier.main(CompressionVerifier.java:65)


Has anyone else run into this? Using CompressionStrategy.DEFAULT_STRATEGY fixes the problem, so i assume it is specific to the Z_FIXED strategy. If you know the zlib codebase, and care to help verify/fix the problem, let me know, and I can send you the input file that caused the problem.

No comments:

Post a Comment