0

Appending to SequenceFiles seems to be very slow. We're converting folders (with small files in it) to SequenceFiles using the filename as the key and the contents as the value. However, the throughput is quite low with about 2MB/s (about 2 to 3 files per second). We have Mio. of small files and at max 3 files per second is incredibly slow for our purposes.

What we're doing is a simple:

for(String file : files) {
  byte[] data = Files.readAllBytes(Paths.get(dir.getAbsolutePath()
                    + File.separatorChar + file));
  byte[] keyBytes = l.getBytes("UTF-8");
  BytesWritable key = new BytesWritable(keyBytes);
  BytesWritable val = new BytesWritable(data);

  seqWriter.append(key, val);
}

Any hints, ideas on how to speed things up?

mroman
  • 1,354
  • 9
  • 14

1 Answers1

0

Most of the time the culprit is writing compressed (e.g. gzip without native lib support). You didn't mention how you setup the seqWriter, so this is just a guess.

Another thing to speedup would be to prefetch the files in batches or asynchronously and in parallel as the latency to download small files might be the bottleneck and not the actual append operations.

If append is the bottleneck, you can also increase the buffer size. Either configure io.file.buffer.size (default 4k) or pass it into the writer builder using the BufferSizeOption option.

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
  • The files or locally available on the machine running the HDFS. We do use compression, so I'll try disabling it and do some benchmarks without it. – mroman May 02 '16 at 15:03
  • Nah, compression doesn't seem to be the real bottleneck. – mroman May 03 '16 at 07:53
  • @mroman then go and grab a profiler and figure out what is taking so long ;) – Thomas Jungblut May 03 '16 at 08:53
  • My guess on this is now pretty much file I/O it's not that our method/code is inherently slow because on a test vm (which actually has like a quarter of the specs the production machine has) it's literally 10 times faster to import. Using compression vs. no compression doesn't really have any measurable effect on speed. On the test VM the speed is about 20 files a 300K per second while on the production machines it's only 2 files per second. – mroman May 10 '16 at 08:39