1

I have been experimenting with generating sequence files for Hadoop outside the Java framework, Python to be specific. There is a python-hadoop module which provides mostly similar framework to do this. I have successfully created sequence files using it; the generated sequence files can be copied to HDF and be used as input for Hadoop jobs. LZO and Snappy are fully configured on my local Hadoop installation, and I can generate proper compressed sequence files with those algorithms when I do so via org.apache.hadoop.io.SequenceFile.createWriter on Java.

However, it seems that valid sequence files are not generated when I try LZO or Snappy as the (block) compression scheme on python-hadoop. I'm using a similar scheme as in this code:

https://github.com/fenriswolf/python-hadoop/blob/master/python-hadoop/hadoop/io/compress/LzoCodec.py

(where I replace lzo with snappy for Snappy compression), and within the python-hadoop frame work those files can be written and read without any errors. On Hadoop, however, I get EOF errors when I feed them as Hadoop input:

Exception in thread "main" java.io.EOFException
        at org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:126)
        at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:98)
        at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:64)
        at java.io.DataInputStream.readByte(DataInputStream.java:265)
        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299)
        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1911)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1934)
        at SequenceFileReadDemo.main(SequenceFileReadDemo.java:34)

I have consistently seen this particular message only when I use LZO or Snappy.

My suspicion is that LzoCodec and SnappyCodec in Hadoop aren't generating or reading in the same way as Python's implementations in lzo and snappy, but I'm not sure what they should be.

Is there any reason why sequence files with those compression schemes are not generated properly outside the Java Hadoop framework? Again, the whole thing works fine so long I use Gzip, BZip2, or Default.

Taro Sato
  • 1,444
  • 1
  • 15
  • 19
  • Have you tried creating a file in both Java and python with the same data key/values in it and doing a hexdump diff on the files? I have some memory that GZip sequence files don't store part of gzip header (magic number) – Chris White May 10 '13 at 00:38
  • @ChrisWhite that's basically how I've been testing. I have had no issues with GZip but LZO and Snappy have given me troubles. Wonder if the codecs for Hadoop are partially proprietary (which I doubt). – Taro Sato May 10 '13 at 16:37
  • So what did the hexdump diffs reveal? – Chris White May 10 '13 at 17:28

0 Answers0