3

We are using Apache Camel for compressing and decompressing our files. We use the standard .marshal().gzip() and .unmarshall().gzip() APIs.

Our problem is that when we get really large files, say 800MB to more than 1GB file size, our application runs out of memory, since the entire file is loading into memory for compression and decompression.

Are there any camel apis or java libraries which will help zip/unzip the file without loading the entire file in memory.

There is a similar unanswered question here

Zabuzard
  • 25,064
  • 8
  • 58
  • 82
phoenixSid
  • 447
  • 1
  • 8
  • 22
  • 1
    The Apache Camel `ZipFileDataFormat.unmarshal()` implementation only supports to create the ZIP archive in memory. If you want to change that you have to implement your own DataFormat that handles this e.g. as stream. – Robert May 11 '18 at 15:39
  • @Robert ok thanks. – phoenixSid May 15 '18 at 12:17

1 Answers1

4

Explanation

Use a different approach: Stream the file.

That is, don't load it into memory completely but read it byte per byte and simultaneously write it back byte per byte .

Get an InputStream to the file, wrap some GZipInputStream around. Read byte per byte, write to an OutputStream.

The opposite if you want to compress an archive. Then you wrap the OutputStream by some GZipOutputStream.


Code

The example uses Apache Commons Compress but the logic of the code remains the same for all libraries.

Unpacking a gz archive:

Path inputPath = Paths.get("archive.tar.gz");
Path outputPath = Paths.get("archive.tar");

try (InputStream fin = Files.newInputStream(inputPath );
        OutputStream out = Files.newOutputStream(outputPath);) {
    GZipCompressorInputStream in = new GZipCompressorInputStream(
        new BufferedInputStream(fin));

    // Read and write byte by byte
    final byte[] buffer = new byte[buffersize];
    int n = 0;
    while (-1 != (n = in.read(buffer))) {
        out.write(buffer, 0, n);
    }
}

Packing as gz archive:

Path inputPath = Paths.get("archive.tar");
Path outputPath = Paths.get("archive.tar.gz");

try (InputStream in = Files.newInputStream(inputPath);
        OutputStream fout = Files.newOutputStream(outputPath);) {
    GZipCompressorOutputStream out = new GZipCompressorOutputStream(
        new BufferedOutputStream(fout));

    // Read and write byte by byte
    final byte[] buffer = new byte[buffersize];
    int n = 0;
    while (-1 != (n = in.read(buffer))) {
        out.write(buffer, 0, n);
    }
}

You could also wrap BufferedReader and PrintWriter around if you feel more comfortable with them. They manage the buffering themselves and you can read and write lines instead of bytes. Note that this only works correct if you read a file with lines and not some other format.

Zabuzard
  • 25,064
  • 8
  • 58
  • 82
  • Thanks for the answer. Actually I know streaming the file is one way. I am searching for some APIs instead of manually streaming the file – phoenixSid May 15 '18 at 06:03
  • An API which does what? This is already very small code. Pack it into a `zip` and `unzip` method and you have your small utility methods. – Zabuzard May 15 '18 at 11:14
  • 1
    Its a small code yes. But I am not zipping/unzipping from a stream. It's a camel route which does a lot many things and then receives the message via a message queue and then compresses the message. – phoenixSid May 15 '18 at 12:19
  • 1
    Since i couldn't find any other direct api-like solution, I am using this approach only. Thanks again. Accepting your answer. – phoenixSid May 23 '18 at 15:08
  • 1
    @phoenixSid it would be nice to do this with camel i create a ticket about https://issues.apache.org/jira/browse/CAMEL-13774 – Michael Jul 21 '19 at 19:57