Read Hadoop SequenceFile: weird hex number stream

Question

I am trying to convert a piece of Hadoop SequenceFile into plain text with the following code:

    Configuration config = new Configuration();
    Path path = new Path( inputPath );
    SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
    WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
    Writable value = (Writable) reader.getValueClass().newInstance();

    File output = new File(outputPath);
    if(!output.exists()) output.createNewFile();

    FileOutputStream fos = new FileOutputStream(output);
    BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fos, "utf-8"));

    int count = 0;

    try {
        while(reader.next(key,value) && count < 1000)
        {
            bw.write("Key::: " + key);
            bw.newLine();
            bw.write("Value::: " + value);
            bw.newLine();
            bw.newLine();
            count++;
        }
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    reader.close();
    bw.close();

The keys can be properly converted. However, the values are converted into weired HEX number stream. A sample is:

Value::: 1f 8b 08 00 00 00 00 00 00 03 e5 bd f9 7b 13 47 d6 28 fc 73 e6 79 e6 7f e8 28 17 6c 5f bc 68 5f 6c e4 5c 96 64 26 33 c9 24 37 cb bc ef 3b 0c 9f 9f 56 77 cb ee 58 96 34 5a 20 8e e3 3f 46 56 c2 10 30 c4 8b e4 4d 5e b1 6c 4b f2 22 59 b2 65 63 48 08 04 42 12 c2 9e 00 21 cb f3 9d 53 d5 2d b5 64 4b 16 33

The real stream is much longer than this. What I know is that the keys are stored as Hadoop Text format and the values are stored as Hadoop BytesWritable. And the values might be in Chinese, but I am not sure about this.

Does anybody know what is going on?

For what it's worth, have you actually tried `hadoop fs -text `? If the `Writable`s have `toString()` implemented, it should print sensible output... — TC1, Mar 14 '13 at 08:48
Sorry, somehow didn't see the note that values are `BytesWritable`. It makes sense that they are output the way they are -- `BytesWritable` is just what it says it is -- an array of bytes, think `byte[]`. If you want to output them as text (and you say they "might" be in Chinese) you'll need to know the encoding and convert that byte array to a `String` before printing it. There's a constructor overload for `String(byte[], Charset)`, but as I said -- you'll need to know (or guess) the encoding. — TC1, Mar 14 '13 at 09:07

score 1 · Accepted Answer · answered Mar 14 '13 at 09:14

1

You say the values are stored as BytesWritable. That maps to byte[] in Java, a byte array -- and that is exactly what's being printed, since the toString() method is overloaded to do that.

You also mention that the bytes might be text in Chinese. If you want to output that, you'll need to encode the bytes to String. You should change the line

bw.write("Value::: " + value);

to a couple of others.

byte[] strBytes = ((BytesWritable) value).getBytes();
bw.write("Value::: " + new String(strBytes, Charset.forName("UTF-8")));

This assumes the Chinese string is encoded using "UTF-8", which might now be the case. You'll have to try different encodings and see what works if you don't know the exact one.

answered Mar 14 '13 at 09:14

TC1

1
3
20
31

Yes. It works now. But the output is still garbage. I guess the text should be zipped in some way before written to BytesWritable. Thx. – Yuhao Mar 14 '13 at 09:34
As I said, the encoding might be something more locally specific, I know my country uses `Win-1257` on occasion. Chinese might have something equally lame. There are frameworks that can try guessing it, but I can't help you with that. – TC1 Mar 14 '13 at 09:54

Read Hadoop SequenceFile: weird hex number stream

1 Answers1