0

The setup

Say I've got:

  • A series of numbers resulting from LZW compression of a bitmap:

    256 1 258 258 0 261 261 259 260 262 0 264 1 266 267 258 2 273 2 262 259 274 275 270 278 259 262 281 265 276 264 270 268 288 264 257
    
  • An LZW-compressed, variable-length-encoded bytestream (including the LZW code size header and sub-block markers) which represents this same series of numbers:

    00001000 00101001 00000000 00000011 00001000 00010100 00001000 10100000 
    01100000 11000001 10000001 00000100 00001101 00000010 01000000 00011000 
    01000000 11100001 01000010 10000001 00000010 00100010 00001010 00110000 
    00111000 01010000 11100010 01000100 10000111 00010110 00000111 00011010 
    11001100 10011000 10010000 00100010 01000010 10000111 00001100 01000001 
    00100010 00001100 00001000 00000000
    
  • And an initial code width of 8.

The problem

I'm trying to derive the initial series of numbers (the integer array) from the bytestream.

From what I've read, the procedure here is to take the initial code width, scan right-to-left, reading initial code width + 1 bits at a time, to extract the integers from the bytestream. For example:

iteration #1:   1001011011100/001/    yield return 4
iteration #2:   1001011011/100/001    yield return 1
iteration #3:   1001011/011/100001    yield return 6
iteration #4:   1001/011/011100001    yield return 6

This procedure will not work for iteration #5, which will yield 1:

iteration #5:   1/001/011011100001    yield return 1 (expected 9)

The code width should have been increased by one.

The question

How am I supposed to know when to increase the code width when reading the variable-length-encoded bytestream? Do I have all of the required information necessary to decompress this bytestream? Am I conceptually missing something?

UPDATE:

After a long discussion with greybeard - I found out that I was reading the binary string incorrectly: 00000000 00000011 00 is to be interpreted as 256, 1. The bytestream is not read as big-endian.

And very roughly speaking, if you are decoding a bytestream, you increase the number of bits read every time you read 2^N-1 codes, where N is the current code width.

alex
  • 6,818
  • 9
  • 52
  • 103
  • FYI - this is not my homework. – alex Dec 21 '16 at 16:41
  • 1
    How does `an initial code width of 2` work if the alphabet is decimal digits? Or is it { 4, 1, 6, 9 }? (Why do you process groups of 3 bits?) – greybeard Dec 21 '16 at 19:22
  • @greybeard The initial code width is 2 because the fully decompressed result is an array with integers 0 through 3. `[ 4, 1, 6, 6, 9...]` represents the LZW-encoded series prior to packing the values into a variable-length encoded stream. – alex Dec 22 '16 at 02:32
  • Please _do_ include the "plaintext" with your example. Depending on implementation/convention details, the very first code may be short enough to just encode one source symbol. _Every time_ a code is output while the dictionary isn't full, another code is assigned: the number of possible output codes increases. If these are represented by an integral number of symbols from a fixed alphabet (e.g., {0, 1} for binary), that number may need to increase. The decoder keeps track of the dictionary, _including the number of possible codes_: it will know _3 bits_ up to the `6` and _four bits_ from `9`. – greybeard Dec 22 '16 at 08:53
  • (Have a look at the "related" column, esp. [How to detect codeword length for LZW Decoding](http://stackoverflow.com/questions/35755758/how-to-detect-codeword-length-for-lzw-decoding?rq=1http://stackoverflow.com/questions/35755758/how-to-detect-codeword-length-for-lzw-decoding?rq=1).) – greybeard Dec 22 '16 at 08:56
  • @greybeard I'm following this tutorial [here](http://giflib.sourceforge.net/whatsinagif/lzw_image_data.html). To be clear, my LZW implementation works correctly. My problem is with packing the bytes. I know how to pack them, but I don't know how to unpack them. – alex Dec 23 '16 at 15:43
  • (`my LZW implementation works correctly` - how do you know?) – greybeard Dec 23 '16 at 20:33
  • @greybeard Because I went from 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, ... to #4 #1 #6 #6 #2 #9 #9 #7 #8 #10 #2 #12 #1 #14 #15 #6 #0 #21 #0 #10 #7 #22 #23 #18 #26 #7 #10 #29 #13 #24 #12 #18 #16 #36 #12 #5 and back – alex Dec 24 '16 at 20:00
  • ` I went from 1, 1, 1, 1, 1, 2, 2, … to #4 #1 #6 #6 #2 #9 #9 #7…` Better than nothing. Why does the question present `4, 1, 6, 6, 9`? – greybeard Dec 26 '16 at 00:10
  • @greybeard I am trying to go from `#4 #1 #6 #6 #2 #9 #9 #7 #8 #10 #2 #12 #1 #14 #15 #6 #0 #21 #0 #10 #7 #22 #23 #18 #26 #7 #10 #29 #13 #24 #12 #18 #16 #36 #12 #5` to `0x00, 0x03, 0x08, 0x14, 0x08, 0xa0, 0x60, 0xc1, 0x81, 0x04, 0x0d, 0x02, 0x40, 0x18, 0x40, 0xe1, 0x42, 0x81, 0x02, 0x22, 0x0a, 0x30, 0x38, 0x50, 0xe2, 0x44, 0x87, 0x16, 0x07, 0x1a, 0xcc, 0x98, 0x90, 0x22, 0x42, 0x87, 0x0c, 0x41, 0x22, 0x0c, 0x08` – alex Dec 27 '16 at 01:51
  • @greybeard To be more clear - I am attempting to implement the GIF variation of the LZW algorithm (LZW GIF). This variant involves taking an LZW-compressed data stream and further compressing it using variable-length encoding after LZW encoding has occurred. I took my input and encoded it - it worked. My question is not about LZW encoding step (which works), it's about variable-length decoding, as the title states. – alex Dec 27 '16 at 02:02
  • @greybeard I'm basically just reading a binary string and looking for a 4 and then a 1 and then a 6, etc. – alex Dec 27 '16 at 02:05
  • (I don't need a claim stated more than once. I think it easier to argue if input didn't switch between `1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, …` and `1, 1, 1, 2, 2, 2, 0,…`, output between `4, 1, 6, 6, 9,…` and `4, 1, 6, 6, 2, 9,…`: clean up the question.) – greybeard Dec 27 '16 at 02:07
  • `My question is not about LZW encoding step (which works), it's about variable-length decoding, as the title states` - why not follow the suggestion in my comment to my answer? _When_ do you assign code `8`? – greybeard Dec 27 '16 at 02:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/131605/discussion-between-alex-and-greybeard). – alex Dec 27 '16 at 15:12

1 Answers1

2

Decompressing, you are supposed to build a dictionary in much the same way as the compressor. You know you need to increase the code width as soon as the compressor might use a code too wide for the current width.

As long as the dictionary is not full (the maximum code is not assigned), a new code is assigned for every (regular) code put out (not the Clear Code or End Of Information codes).

With the example in the presentation you linked, 8 is assigned when the second 6 is "transmitted" - you need to switch to four bits before reading the next code.

(This is where the example and your series of numbers differ - the link presents 4, 1, 6, 6, 2, 9.)

alex
  • 6,818
  • 9
  • 52
  • 103
greybeard
  • 2,249
  • 8
  • 30
  • 66
  • Just to clarify, there's no *might* in the sense of there being a choice as to when to switch to a larger width. – Alexey Frunze Dec 24 '16 at 01:19
  • This doesn't sound relevant to my problem. My LZW encoder and LZW decoders both work correctly, as I have mentioned numerous times above. To be crystal clear: I took an index stream `[1, 1, 1, 2, 2, 2, 0, 0, ... ]` and used my LZW encoder to convert it to a code stream `[4, 1, 6, 6, 9, ... ]`. I can also use my LZW decoder to convert the code stream back to an index stream. My issue is with packing the code stream into a new byte array, and unpacking a byte array into the code stream. – alex Dec 24 '16 at 19:59
  • Attitude aside: the second `6` (lowest/first additional code) looks wrong for any input sequence _not_ starting with exactly 5 identical symbols. `2` _not_ appearing looks wrong: for a given input sequence, try to present a table similar to the one with columns labelled `Step | Action | Index | Stream | New Code Table Row | Code Stream` from the link: in your question. `I can also use my LZW decoder to convert the code stream back to an index stream.` please demonstrate that on `4, 1, 6, 6, 9`. – greybeard Dec 24 '16 at 21:42