3

I writing a small (relatively) application on haskell in academic purpose. I'm implementing a Huffman compression, based on this code http://www.haskell.org/haskellwiki/Toy_compression_implementations .

My variant of this code is here https://github.com/kravitz/har/blob/a5d221f227c27fd1c5587217a29a169a377521a6/huffman.hs and it uses lazy bytestrings. When I implemented RLE compression everything was smooth, because it process the input stream in one step. But Huffman process it twice and as a result I have an evaluated bytestring stored in memory, which is bad for a big files (but for relatively small files it allocates too much space in heap too). That is not only my suspicion, because profiling also shows that most of the heap eaten by bytestring allocation.

Also I seriallizing a stream length in file, and it also may cause the full bytestring loading in memory. Is there any simple way to say ghc be kindly and re-evaluate stream several times?

kravitz
  • 973
  • 2
  • 10
  • 22
  • [Adaptive Huffman Coding](http://en.wikipedia.org/wiki/Adaptive_Huffman_coding) only needs a single pass. – ephemient Feb 16 '11 at 15:00
  • 1
    Nice, but in fact its my task to implement Huffman compression in two pass way. I choose a haskell for it because I'm trying to learn this language and as usual knowledge comes with practice. – kravitz Feb 16 '11 at 15:50

2 Answers2

4

Instead of passing a bytestring to the encoder, you can pass something that computes a bytestring, then explicitly recompute the value each time you need it.

compress :: ST s ByteString -> ST s ByteString
compress makeInput = do
  len      <- (return $!) . ByteString.length =<< makeInput
  codebook <- (return $!) . makeCodebook      =<< makeInput
  return . encode len codebook                =<< makeInput

compressIO :: IO ByteString -> IO ByteString
compressIO m = stToIO (compress (unsafeIOToST m))

The parameter to compress should actually compute the value. Simply wrapping a value with return won't work. Also, each call to makeInput must actually have its result evaluated, else there will remain a lazy, un-evaluated copy of the input in memory when the input is recomputed.

The usual approach, as barsoap said, is to just compress one block at a time.

Heatsink
  • 7,721
  • 1
  • 25
  • 36
  • I like your idea, but I can't implement it due to type mismatch complains, that expected `IO ByteString`, but inferred is `ST s ByteString`. It happens in a lines with (=<<) operator. Excuse me that I am so nube in haskell :) – kravitz Feb 16 '11 at 15:39
  • @kravitz Ah, I didn't notice that 'evaluate' only works in IO. I changed 'evaluate' to (return $!), which can run in ST. – Heatsink Feb 16 '11 at 17:02
  • @Heatsink Is it enough if I pass a `Data.ByteString.Lazy.readFile name` to compress IO ? Because memory consumption is still huge (1.5G for 10M file) – kravitz Feb 16 '11 at 23:24
  • @kravitz That should be sufficient. If it is still using too much space, the best approach is to profile (http://haskell.org/ghc/docs/latest/html/users_guide/prof-heap.html) with -hd to learn what data is occupying memory, and -hr to learn what function is retaining the data. – Heatsink Feb 17 '11 at 02:14
  • @Heatsink the most of the data are Chunks (from which the lazy ByteStrings consist), and the function that retains most of the data is compress (that written in same manner you advised) – kravitz Feb 17 '11 at 09:27
  • Yay, I found that a serializing a lazy bytestring with a lots of lazy evaluations around it is a seriously bad idea (cause it calls length on it) – kravitz Feb 17 '11 at 11:13
  • Calling hPut on a lazy bytestring should be lazy (it uses a lazy fold function internally). Is that the function you're using? – Heatsink Feb 17 '11 at 17:27
  • no, I use Binary.encode for serialization in ByteString stream and Data.ByteString.Lazy.writeFile for actual writing – kravitz Feb 18 '11 at 09:11
  • I see. To avoid the space leak, you will have to use an output method that consumes the data lazily. – Heatsink Feb 18 '11 at 19:54
  • I already implemented that -- I serialize tree and length of initial stream, then I append the raw lazy bytestring, that comes as an output of Huffman compression algorithm. To deserialize I use Binary.get on tree and length, and then getRemainingLazyBytestring to get my compressed bytestring. All operations here involve lazy bytestrings only, so I achieved a low memory (even a constant) consumption for input files of any length. – kravitz Feb 19 '11 at 01:12
3

The usual approach when (Huffmann-)compressing, as one can't get around processing the input twice, once to collect the probability distribution, and once to do the actual compressing, is to chunk up the input into blocks and compress each separately. While that still eats memory, it only eats, maximally, a constant amount.

That said, you might want to have a look at bytestring-mmap, though that won't work with standard input, sockets, and other file descriptors that aren't backed by a file system which supports mmap.

You can also re-read the bytestring from file (again, provided you're not receiving it from anything pipe-like) after collecting the probability distribution, but that will still make your code bail out on say 1TB files.

barsoap
  • 3,376
  • 3
  • 23
  • 21