Why the SequenceFile is truncated?

Question

I am learning Hadoop and this problem has baffled me for a while. Basically I am writing a SequenceFile to disk and then read it back. However, every time I get an EOFException when reading. A deeper look reveals that when writing the sequence file, it is prematurely truncated, and it always happens after writing index 962, and the file always has a fixed size of 45056 bytes.

I am using Java 8 and Hadoop 2.5.1 on a MacBook Pro. In fact, I tried the same code on another Linux machine under Java 7, but the same things happens.

I can rule out writer/reader is not properly closed. I tried using the old styled try/catch with an explicit writer.close() as shown in the code, and also use the newer try-with-resource approach. Both are not working.

Any help will be highly appreciated.

Following is the code I am using:

public class SequenceFileDemo {

private static final String[] DATA = { "One, two, buckle my shoe",
    "Three, four, shut the door",
    "Five, six, pick up sticks",
    "Seven, eight, lay them straight",
    "Nine, ten, a big fat hen" };

public static void main(String[] args) throws Exception {
    String uri = "file:///Users/andy/Downloads/puzzling.seq";
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(URI.create(uri), conf);

    Path path = new Path(uri);      
    IntWritable key = new IntWritable();
    Text value = new Text();

    //API change
    try {
        SequenceFile.Writer writer = SequenceFile.createWriter(conf, 
            stream(fs.create(path)),
            keyClass(IntWritable.class),
            valueClass(Text.class));

        for ( int i = 0; i < 1024; i++ ) {
            key.set( i);
            value.clear();
            value.set(DATA[i % DATA.length]);

            writer.append(key, value);
            if ( (i-1) %100 == 0 ) writer.hflush();
            System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
        }

        writer.close();

    } catch (Exception e ) {
        e.printStackTrace();
    }


    try {
        SequenceFile.Reader reader = new SequenceFile.Reader(conf, 
                SequenceFile.Reader.file(path));
        Class<?> keyClass = reader.getKeyClass();
        Class<?> valueClass = reader.getValueClass();

        boolean isWritableSerilization = false;
        try {
            keyClass.asSubclass(WritableComparable.class);
            isWritableSerilization = true;
        } catch (ClassCastException e) {

        }

        if ( isWritableSerilization ) {
            WritableComparable<?> rKey = (WritableComparable<?>) ReflectionUtils.newInstance(keyClass, conf);
            Writable rValue = (Writable) ReflectionUtils.newInstance(valueClass, conf);
            while(reader.next(rKey, rValue)) {
                System.out.printf("[%s] %d %s=%s\n",reader.syncSeen(), reader.getPosition(), rKey, rValue);
            }
        } else {
            //make sure io.seraizliatons has the serialization in use when write the sequence file
        }

        reader.close();
    } catch(IOException e) {
        e.printStackTrace();
    }
}

}

Indeed, I can reproduce this on Windows 8, Java 8 and Hadoop 2.2 as well- even when just writing the integers. Interesting bug you found there. And it actually seems to truncate the file towards the end for some reason. — Thomas Jungblut, Jan 13 '15 at 11:23

score 1 · Answer 1 · answered Jan 13 '15 at 07:47

1

I think you are missing writer.close() after write loop. That should gaurantee a final flush before you start reading.

answered Jan 13 '15 at 07:47

yurgis

4,017
1
13
22

Thanks but it is not the case. I've added the close() before thinking exactly the same, yet it does not work. – Andy Jan 13 '15 at 10:09

score 1 · Accepted Answer · answered Jan 13 '15 at 11:50

I actually found the error, it is because you are never closing the created stream in Writer.stream(fs.create(path)).

For some reason the close doesn't propagate down to the stream you just created there. This is a bug I suppose, but I'm too lazy to look it up in Jira for now.

One way to fix your problems is to simply use Writer.file(path) instead.

Obviously, you can also just close the create stream explicitly. Find my corrected example below:

    Path path = new Path("file:///tmp/puzzling.seq");

    try (FSDataOutputStream stream = fs.create(path)) {
        try (SequenceFile.Writer writer = SequenceFile.createWriter(conf, Writer.stream(stream),
                Writer.keyClass(IntWritable.class), Writer.valueClass(NullWritable.class))) {

            for (int i = 0; i < 1024; i++) {
                writer.append(new IntWritable(i), NullWritable.get());
            }
        }
    }

    try (SequenceFile.Reader reader = new SequenceFile.Reader(conf, Reader.file(path))) {
        Class<?> keyClass = reader.getKeyClass();
        Class<?> valueClass = reader.getValueClass();

        WritableComparable<?> rKey = (WritableComparable<?>) ReflectionUtils.newInstance(keyClass, conf);
        Writable rValue = (Writable) ReflectionUtils.newInstance(valueClass, conf);
        while (reader.next(rKey, rValue)) {
            System.out.printf("%s = %s\n", rKey, rValue);
        }

    }

Thanks Thomas! I verified the fix and it works. In addition, your answer prompts to look at the source code. When creating the writer, if we pass in the option **Writer.file(path)**, the writer "owns" the underlying stream created internally, and will close it when close() is called. Yet if we pass in **Writer.stream(aStream)**, the writer assumes someone else is response for that stream and won't close it when close() is called. — Andy, Jan 14 '15 at 02:16

score 0 · Answer 3 · answered Jan 14 '15 at 02:25

Thanks to Thomas.

It boils down to if the writer created "owns" the stream of not. When creating the writer, if we pass in the option Writer.file(path), the writer "owns" the underlying stream created internally, and will close it when close() is called. Yet if we pass in Writer.stream(aStream), the writer assumes someone else is response for that stream and won't close it when close() is called. In short, it is not a bug, just that I do not understand it well enough. .

Why the SequenceFile is truncated?

3 Answers3

Linked