0

I am compressing files of over 2GB in Java using a consecutive application of two compression algorithms; one LZ based and one Huffman-based. (This is similar to DEFLATE).

Since 2GB is too large to be held in any buffer, I have to pass the file through one algorithm outputting a temporary file, then pass that temporary file through the second algorithm outputting the final file.

An alternative is to compress the file in 8MB blocks (the size where I don't get an Out-Of-Memory error) but then I have an inability to take full advantage of the redundancy within the entire file.

Any ideas how to perform these operations neater. No temporary files, and no compressing in blocks? Do any other compression tools compress in blocks? How do they deal with this issue? Regards

Danny Rancher
  • 1,923
  • 3
  • 24
  • 43
  • If you're running a 64-bit JVM you should be able to allocate enough heap space to use MUCH larger blocks (i.e. 1GB instead of 8MB). Look at the `-Xms` and `-Xmx` JVM options. – Jim Garrison Feb 06 '14 at 17:35
  • 1
    Do your algorithm implementations not produce any output until they have completely read the input? If that is the case you're out of luck and will need to use temporary storage. However, I seriously doubt that is the case, each algorithm starts producing output after having read some portion of the input. In that case you can use pipes to feed the output stream of the first algorithm to the second, and write the output from the second to disk. – Jim Garrison Feb 06 '14 at 17:49
  • 1
    I think you overestimate the “ability to take full advantage of the redundancy within the entire file”. Use smaller blocks. Though it is strange that you can’t use block bigger than 8MB. You seem to have a very small heap. – Holger Feb 06 '14 at 18:26

3 Answers3

1

Java comes with “java.util.zip” library to perform data compression in ZIp format. The overall concept is quite straightforward.

Library reads file with “FileInputStream”. And add the file name to “ZipEntry” and output it to “ZipOutputStream“

import java.util.zip.ZipEntry and import java.util.zip.ZipOutputStream are used for importing Zip folder to a program.

But how can decompress a file

?

Jeevan Roy dsouza
  • 653
  • 3
  • 12
  • 32
rahul
  • 11
  • 2
  • This java.util.zip compresses and concatenates files separately whereas I wish to make use of the solid compression paradigm http://en.wikipedia.org/wiki/Solid_compression. java.util.zip also fails on large files (2GB +). – Danny Rancher Feb 07 '14 at 16:10
0

What's wrong with piping of streams? You can read from InputStream, compress the bytes and write them to output stream that is connected to input stream of the next algorithm. Take a look on PipeInputStream and PipeOutputStream.

I hope that these algorithms can work incrementally.

AlexR
  • 114,158
  • 16
  • 130
  • 208
  • Hi, thank you for your answer. I do not understand your use of the word incrementally. My first algorithm must complete before the second one can be applied. Regards. – Danny Rancher Feb 06 '14 at 17:41
  • I mean that I hope that your algorithm can read a limited chunk of bytes, copress them, write them to output stream to move to processing of the next chunk and do not have to hold in memory the whole input to process it from the beginning until the end. – AlexR Feb 06 '14 at 19:11
  • "My first algorithm must complete before the second one can be applied." seems quite odd. Does your second algorithm work on the output of the first algorithm backwards? – Mark Adler Feb 07 '14 at 04:52
0

You could use two levels of java.util.zip. First, just concatenate all files (without compression). If possible, sort the entries by file type so that similar files are next to each other (this will increase compression ratio). Second, compress this stream. You don't need to run two separate phases; instead, you can wrap the first within the second stage, like CompressStream(ConcatenateFiles(directory)). That way you have a zip file within another zip file: the outer zip file is compressed, the inner is not and contains all the actual files.

It is true that java.util.zip used to have problems with files larger than 2 GB (I did run into those problems). However, I believe that was only the case for ZipFile and not for ZipIn/OutputStream. Also, I think those problems are fixed with recent Java versions.

Buffer size: regular compression algorithms such as Deflate will not benefit from chunk sizes larger than about 64 KB. More advanced algorithms can benefit from using larger chunk sizes, for example bzip2 up to 900 KB, or LZMA2 up to 2 MB. Anything beyond that is more likely the domain of data deduplication, which might or might not make sense for what you want to do.

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132