0

First, Consider this CustomWriter class:

public final class CustomWriter {

  private final SequenceFile.Writer writer;

  CustomWriter(Configuration configuration, Path outputPath) throws IOException {
    FileSystem fileSystem = FileSystem.get(configuration);
    if (fileSystem.exists(outputPath)) {
      fileSystem.delete(outputPath, true);
    }

    writer = SequenceFile.createWriter(configuration,
        SequenceFile.Writer.file(outputPath),
        SequenceFile.Writer.keyClass(LongWritable.class),
        SequenceFile.Writer.valueClass(ItemWritable.class),
        SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()),
        SequenceFile.Writer.blockSize(1024 * 1024),
        SequenceFile.Writer.bufferSize(fileSystem.getConf().getInt("io.file.buffer.size", 4 * 1024)),
        SequenceFile.Writer.replication(fileSystem.getDefaultReplication(outputPath)),
        SequenceFile.Writer.metadata(new SequenceFile.Metadata()));
  }

  public void close() throws IOException {
    writer.close();
  }

  public void write(Item item) throws IOException {
    writer.append(new LongWritable(item.getId()), new ItemWritable(item));
  }
}

What I am trying to do is consume a asynchronous stream of Item type objects. The consumer has a reference to a CustomWriter instance. It then calls the CustomWriter#write method for every item it receives. When the stream ends, the CustomWriter#close method is called to close the writer.

As you can see I've only created a single writer and it starts appending to a brand new file. So, there is no question that this is not the cause.

I should also note that I am currently running this in a unit-test environment using MiniDFSCluster as per the instructions here. If I run this in a non unit-test environment (i.e. without MiniDFSCluster), it seems to work just fine.

When I try to read the file back all I see is the last written Item object N times (where N is the total number of items that were received in the stream). Here is an example:

sparkContext.hadoopFile(path, SequenceFileInputFormat.class, LongWritable.class, ItemWritable.class)
    .collect()
    .forEach(new BiConsumer<>() {
      @Override
      public void accept(Tuple2<LongWritable, ItemWritable> tuple) {
        LongWritable id = tuple._1();
        ItemWritable item = tuple._2();
        System.out.print(id.get() + " -> " + item.get());
      }
    });

This will print something like this:

...
1234 -> Item[...]
1234 -> Item[...]
1234 -> Item[...]
...

Am I doing something wrong or, is this a side effect of using MiniDFSCluster?

Community
  • 1
  • 1
nuaavee
  • 1,336
  • 2
  • 16
  • 31

1 Answers1

1

Writable (such as LongWritable, ItemWritable) is reused during processing data. When receiving a record, Writable usually just replaces its content, and you will just receive the same Writable object. You should copy them to a new object if you want to collect them into an array.

zsxwing
  • 20,270
  • 4
  • 37
  • 59