0

I have tested different base64 encoders mig64,iHarder,sun etc. Seems like these need to have the whole data need to be in memory for conversion.

If I want to encode a large file (stream) > 1gb in a multi-threaded fashion, which codec implementation can be used without corrupting the file? commons codec seems to have the base64outputstream wrapper. any other solutions?

To make it clear, I have a 1TB file, and this file need to be encoded base64. Machine memory 2GB RAM, what is the fastest way to do it in Java?

zudokod
  • 4,074
  • 2
  • 21
  • 24

1 Answers1

1

I'm not sure which encoder is faster offhand, you'll have to measure each to determine that. However you can avoid the memory problem and accomplish the concurrency by splitting the file into chunks. Just make sure you split them on some 6-byte boundary (since it evenly turns into 8 bytes in Base64).

I'd recommend picking a reasonable chunk size and using an ExecutorService to manage a fixed number of threads to do the processing. You can share a RandomAccessFile between them and write to the appropriate places. You'll of course have to calculate the output chunk offsets (just multiple by 8 and divide by 6).

Honestly though you might not realize much performance gain here with concurrency. It could just overwhelm the hard drive with random access. I'd start with chunking the file up using a single thread. See how fast that is first. You can probably crunch a 1GB file faster than you think. As a rough guess I'd say 1 minute on modern hardware, even writing to the same drive you're reading from.

WhiteFang34
  • 70,765
  • 18
  • 106
  • 111
  • how to ensure the integrity, say after line breaks after 76 characters etc? – zudokod Apr 14 '11 at 18:51
  • I wouldn't split it on line breaks, you'll need to split on a fixed byte boundary. If you read line by line then you can't guarantee that each line is a multiple of 6 bytes. – WhiteFang34 Apr 14 '11 at 18:54
  • i meant writing... the output should have,by spec,line breaks after 76 for larger chunks. ie File is converted to another file having characters, will have line breaks after 76 characters according to the specification – zudokod Apr 14 '11 at 18:58
  • Ah, I see. You need a chunk size that produces full 76 character lines. Then you can calculate the destination offset. For example 3648 input characters will produce 4864 output characters in Base64. That's 64 lines of output. Assuming that you have 2 bytes for a CRLF at the end of each line that adds another 128 bytes of output. So for each 3648 byte input chunk you'll get a 4992 byte output chunk. Just write to the correct offset in the file for the chunk you're processing. – WhiteFang34 Apr 14 '11 at 19:06