0

I'm writing a general LZW decoder c++ program and I'm having trouble finding documentation on the length (in bits) of codewords used. Some articles I've found say that codewords are 12bits long, while others say 16bits, while still others say that variable bit length is used. So which is it? It would make sense to me that bit length is variable since that would give the best compression (i.e. initially start with 9 bits, then move to 10 when necessary, then move to 11 etc...). But I can't find any "official" documentation on what the industry standard is.

For example, if I were to open up Microsoft Paint and create a simple 100x100pixel all black image and save it as a Tiff. The image is saved in the Tiff using LZW compression. So in this scenario when I'm parsing the LZW codewords, should I read in 9bits, 12bits, or 16bits for the first codeword? and how would I know which to use?

Thanks for any help you can provide.

Stanton
  • 904
  • 10
  • 25
  • LZW by itself is only an algorithm, you won't find an industry standard for the specific parameters. For that you need to look deeper into each application, in your example the TIFF spec. – Mark Ransom Mar 02 '16 at 19:17
  • Thanks Mark for having me search for the Tiff implementation of LZW. I just thought it was a generic algorithm... Turns out they have a few reserved entries in the dictionary (such as Clear and an EndOfInformation) as well as bit order information that I had wrong. – Stanton Mar 02 '16 at 19:51

1 Answers1

2

LZW can be done any of these ways. By far the most common (at least in my experience) is start with 9 bit codes, then when the dictionary gets full, move to 10 bit codes, and so on up to some maximum size.

From there, you typically have a couple of choices. One is to clear the dictionary and start over. Another is to continue using the current dictionary, without adding new entries. In the latter case, you typically track the compression rate, and if it drops too far, then you clear the dictionary and start over.

I'd have to dig through docs to be sure, but if I'm not mistaken, the specific implementation of LZW used in TIFF starts at 9 and goes up to 12 bits (when it was being designed, MS-DOS was a major target, and the dictionary for 12-bit codes used most of the available 640K of RAM). If memory serves, it clears the table as soon as the last 12-bit code has been used.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Thanks Jerry, So the question is which codeword length is correct for my example? In the case of the all black image (the first pixel is RGB 0,0,0), I would expect then the first codeword (if 9bits) to contain all zeros, but in my case it is `000000010` which evaluates to 128, which is not what I was expecting. – Stanton Mar 02 '16 at 19:17
  • @Stanton: You'll need to read the TIFF spec for details. I'm pretty sure it starts at 9 bits, but it's pretty routine to pre-fill the dictionary with 7-bit or 8-bit values. A fair number also start by sending a "clear dictionary" code (and if memory serves, TIFF did) in which case the first word won't normally carry any actual data. – Jerry Coffin Mar 02 '16 at 19:31
  • thanks for the info. I found the Tiff spec in Appendix F of http://www.martinreddy.net/gfx/2d/TIFF-5.txt. they use 9bit codes to start (like you suggested), and there are two reserved codes for Clear and EndOfInfo. Also, I had the bit order wrong. So I should have gotten 256 as the first code, which evaluates to their Clear code. – Stanton Mar 02 '16 at 19:55