1

I'm attempting to write a compressor with Huffman coding. The process involves using Bitarrays to store the values. All's fine and dandy until I load something slightly larger.

Currently I have the program load in a 93mb mp4 video. Part of the encoding process looks like this.

var encodedSource = new List<bool>();
var bitList = new List<BitArray>();
var listSize = 0;
foreach (var t in source)
{
     var encodedSymbol = new bool[dictionary[t].Length];
     dictionary[t].CopyTo(encodedSymbol,0);
     encodedSource.AddRange(encodedSymbol);
     if (encodedSource.Count > 1000000)
     {
         bitList.Add(new BitArray(encodedSource.ToArray()));
         listSize += encodedSource.Count;
         encodedSource = new List<bool>();
     }
}
var bits = new BitArray(listSize);
var index = 0;
foreach (var bitArray in bitList)
{
    foreach (var b in bitArray)
        {
            bits[index++] = (bool) b;
        }
}

The encodedSource and bitList seems to taking far too much space then they should need to (Combined they take around 800mbs upon completion).

After the encoding is done, the bitList is copied into bits, and then a byte array, then finally the file. bits seems to be normal size, about 90mb, and the resulting file with headers and stuff at 91mb is normal too. I can't seem to figure out either why encodedSource and bitList takes so much space, or find some method that will save some space.

--- Explaining the code ---

I loaded the byte and conversion into dictionary to speed up the lookup (time went from 5 min to 69 seconds) bitList exists because just saving it into encodedSource takes way too much space, copying it into bitList takes about half the memory, still mores than 1/8th of what it should actually take, but less.

Edit: Didn't realize I didn't actually put in a question. Question is, why does it take so much space? and what can I do to mitigate that?

Also, I have thought about simply writing directly into the file every X bits, but I haven't gotten around to that yet, I'd like to solve this problem before getting there, but I can do that if needs be.

Knowledge Cube
  • 990
  • 12
  • 35
EricChen1248
  • 466
  • 6
  • 19
  • 1
    Your `encodedSource.Count` comparison with 1000000 strikes me as a ["magic number"](https://en.wikipedia.org/wiki/Magic_number_(programming)#Unnamed_numerical_constants). Where does this value come from? – Knowledge Cube Jun 01 '17 at 16:00
  • Simply a tradeoff between speed and size. Originally the code was only converted to bitarray upon completion. This resulted in a massive encodedSource. First time I changed it to this it was set to 1000, the encodedSource was relatively small, and since BitArrays are smaller, overall it was still better than before. But this resulted in the program garbage collection every second, and slowed down the program alot. So I set it to 1000000. The encodedSource is larger as a result, but runs much much faster. – EricChen1248 Jun 01 '17 at 16:04
  • 1
    It doesn't answer your question, but you know you can avoid all of this in the first place, right? You can easily stream out your symbols to a densely packed `int[]` or `byte[]` without ever having a bit-per-byte representation or anything annoying like intermediate BitArrays. You can represents symbols as a tuple of `int`, one with the bits and one with the length in number of bits and write it out like [this](https://stackoverflow.com/a/28578543/555045) – harold Jun 01 '17 at 16:05
  • @harold It's kinda hard to do per byte, because the encodedSymbol varies in size. Sometimes it's 1 bit, sometimes its 11 bits. Copying the buffer and shifting each time also adds alot of work. I know I could also stream it directly to a file and take it out of memory completely. – EricChen1248 Jun 01 '17 at 16:11
  • 1
    @EricChen1248 oh it's not so bad really. You can enqueue the bits in an `int` or `long` until you've saved up a nice multiple of 8. The only shifting happens on that `int/long` so it's no big deal. See the link from my previous comment to get the idea – harold Jun 01 '17 at 16:30
  • @harold oh I see. Yes dealing with the shift with queue would be much easier. I'll check it out tomorrow, bed time for me here. Hopefully someone can still answer why it takes up so much space. – EricChen1248 Jun 01 '17 at 16:35
  • @EricChen1248 well not a `queue` .. you could use that too but the point is that the `int` *is* the queue (of bits), so you can enqueue an entire symbol in one go and then dequeue whole bytes. – harold Jun 01 '17 at 16:38

0 Answers0