C# - Huffman coding for a large file takes too long

Question

I am trying to implement Huffman coding in C#. I have a problem with encoding large files as it takes too much time. For example to encode a 11MiB binary file it takes 10 seconds in debug mode. And I did not even bother waiting for my program to finish with 27MiB file.

Here is the problematic loop:

            BitArray bits = new BitArray(8);
            byte[] byteToWrite = new byte[1];
            byte bitsSet = 0;

            while ((bytesRead = inputStream.Read(buffer, 0, 4096)) > 0) // Read input in chunks
            {
                for (int i = 0; i < bytesRead; i++)
                {
                    for (int j = 0; j < nodesBitStream[buffer[i]].Count; j++)
                    {
                        if (bitsSet != 8)
                        {
                            bits[bitsSet] = nodesBitStream[buffer[i]][j];
                            bitsSet++;
                        }
                        else
                        {
                            bits.CopyTo(byteToWrite, 0);
                            outputStream.Write(byteToWrite, 0, byteToWrite.Length);
                            bits = new BitArray(8);
                            bitsSet = 0;

                            bits[bitsSet] = nodesBitStream[buffer[i]][j];
                            bitsSet++;
                        }
                    }
                }
            }

nodesBitStream is a Dictionary<byte, List<bool>>. The List<bool> is a representation of path from Huffman tree root to the leaf node containing specific symbol represented as byte.

So I am accumulating bits to form a byte which I write to a encoded file. It is quite obvious that this can take very long time but I have not figured out some other way just yet. Therefore I am asking for advice on how to speed up the process.

just fyi, 4k buffer is probably too small for SSD though its probably not your bottle neck — TheGeneral, Nov 14 '18 at 21:18
@mjwills In Release build it took approximately 9 seconds. Also I am not quite sure about the example you'd want. The dictionary I use has 256 keys and each value (List) has around 8 bool values and I go through all of that. — Popa611, Nov 14 '18 at 21:35
@TheGeneral Yeah, I do not even have SSD on this machine. But it's good to know. Also I did not realize that this could fit code review more as I would like to know a way how to encode properly. I know my code is not good so it does not need reviewing in fact. — Popa611, Nov 14 '18 at 21:40
`I know my code is not good so it does not need reviewing in fact.` That means it **does** need reviewing. — mjwills, Nov 14 '18 at 22:27

score 2 · Answer 1 · answered Nov 14 '18 at 21:16

Working bit by bit is a lot of extra work. Also while a Dictionary<byte, TVal> is decent, a plain array is even faster.

The Huffman codes can also be represented as a pair of integers, one for the length (in bits) and the other holding the bits. In this representation, you can process a symbol in a couple of fast operations, for example (not tested):

BinaryWriter w = new BinaryWriter(outStream);
uint buffer = 0;
int bufbits = 0;
for (int i = 0; i < symbols.Length; i++)
{
    int s = symbols[i];
    buffer <<= lengths[s];  // make room for the bits
    bufbits += lengths[s];  // buffer got longer
    buffer |= values[s];    // put in the bits corresponding to the symbol

    while (bufbits >= 8)    // as long as there is at least a byte in the buffer
    {
        bufbits -= 8;       // forget it's there
        w.Write((byte)(buffer >> bufbits)); // and save it
    }
}
if (bufbits != 0)
    w.Write((byte)(buffer << (8 - bufbits)));

Or some variant, for example you could fill bytes the other way around, or save up bytes in an array and do bigger writes, etc.

This code requires code lengths to be limited to 25 bits max, usually other requirements lower that limit even further. Huge code lengths are not needed to get a good compression ratio.

score 2 · Accepted Answer · answered Nov 14 '18 at 21:26

I don't really know how the algorithm works, but looking at your code two things kind of stick out:

You seem to be using a dictionary to index in with a byte. Maybe a simple List<bool>[] is faster, using buffer[i] to index into it. The memory price you would be paying is rather low. Using an array you would be exchanging look ups with offsets which are faster. You are doing quite a few lookups there.
Why are you instantiating bits on every iteration? Depending on how many iterations you are doing that can end up putting pressure on the GC. There seems to be no need, you are essentially overwriting every bit and spitting it out every 8 bits, so simply overwrite it, don't new it up; use the same instance over and over.

But I do not know how big the `List[]` would be. I need a dynamic array. So maybe List of lists? And with "newing" the BitArray you are right, thank you. — Popa611, Nov 14 '18 at 21:42
@ThePopa611 you do know, given that you're encoding bytes: 256 elements, exactly so much that every different byte can have an entry — harold, Nov 14 '18 at 21:44

C# - Huffman coding for a large file takes too long

2 Answers2

Linked