Decoding TIFF LZW codes not yet in the dictionary

Question

I made a decoder of LZW-compressed TIFF images, and all the parts work, it can decode large images at various bit depths with or without horizontal prediction, except in one case. While it decodes files written by most programs (like Photoshop and Krita with various encoding options) fine, there's something very strange about the files created by ImageMagick's convert, it produces LZW codes that aren't yet in the dictionary, and I don't know how to handle it.

Most of the time the 9 to 12-bit code in the LZW stream that isn't yet in the dictionary is the next one that my decoding algorithm will try to put in the dictionary (which I'm not sure should be a problem although my algorithm fails on an image that contains such cases), but at times it can even be hundreds of codes into the future. In one case the first code after the clear code (256) is 364, which seems quite impossible given that the clear code clears my dictionary of all codes 258 and above, in another case the code is 501 when my dictionary only goes up to 317!

I have no idea how to deal with it, but it seems that I'm the only one with this problem, the decoders in other programs load such images fine. So how do they do it?

Here's the core of my decoding algorithm, obviously due to how much code is involved I can't provide complete compilable code in a compact manner, but since this is a matter of algorithmic logic this should be enough. It follows closely the algorithm described in the official TIFF specification (page 61), in fact most of the spec's pseudo code is in the comments.

void tiff_lzw_decode(uint8_t *coded, buffer_t *dec)
{
    buffer_t word={0}, outstring={0};
    size_t coded_pos;   // position in bits
    int i, new_index, code, maxcode, bpc;

    buffer_t *dict={0};
    size_t dict_as=0;

    bpc = 9;            // starts with 9 bits per code, increases later
    tiff_lzw_calc_maxcode(bpc, &maxcode);
    new_index = 258;        // index at which new dict entries begin
    coded_pos = 0;          // bit position

    lzw_dict_init(&dict, &dict_as);

    while ((code = get_bits_in_stream(coded, coded_pos, bpc)) != 257)   // while ((Code = GetNextCode()) != EoiCode) 
    {
        coded_pos += bpc;

        if (code >= new_index)
            printf("Out of range code %d (new_index %d)\n", code, new_index);

        if (code == 256)                        // if (Code == ClearCode)
        {
            lzw_dict_init(&dict, &dict_as);             // InitializeTable();
            bpc = 9;
            tiff_lzw_calc_maxcode(bpc, &maxcode);
            new_index = 258;

            code = get_bits_in_stream(coded, coded_pos, bpc);   // Code = GetNextCode();
            coded_pos += bpc;

            if (code == 257)                    // if (Code == EoiCode)
                break;

            append_buf(dec, &dict[code]);               // WriteString(StringFromCode(Code));

            clear_buf(&word);
            append_buf(&word, &dict[code]);             // OldCode = Code;
        }
        else if (code < 4096)
        {
            if (dict[code].len)                 // if (IsInTable(Code))
            {
                append_buf(dec, &dict[code]);           // WriteString(StringFromCode(Code));

                lzw_add_to_dict(&dict, &dict_as, new_index, 0, word.buf, word.len, &bpc);
                lzw_add_to_dict(&dict, &dict_as, new_index, 1, dict[code].buf, 1, &bpc);    // AddStringToTable
                new_index++;
                tiff_lzw_calc_bpc(new_index, &bpc, &maxcode);

                clear_buf(&word);
                append_buf(&word, &dict[code]);         // OldCode = Code;
            }
            else
            {
                clear_buf(&outstring);
                append_buf(&outstring, &word);
                bufwrite(&outstring, word.buf, 1);      // OutString = StringFromCode(OldCode) + FirstChar(StringFromCode(OldCode));

                append_buf(dec, &outstring);            // WriteString(OutString);

                lzw_add_to_dict(&dict, &dict_as, new_index, 0, outstring.buf, outstring.len, &bpc); // AddStringToTable
                new_index++;
                tiff_lzw_calc_bpc(new_index, &bpc, &maxcode);

                clear_buf(&word);
                append_buf(&word, &dict[code]);         // OldCode = Code;
            }
        }

    }

    free_buf(&word);
    free_buf(&outstring);
    for (i=0; i < dict_as; i++)
        free_buf(&dict[i]);
    free(dict);
}

As for the results that my code produces in such situations it's quite clear from how it looks that it's only those few codes that are badly decoded, everything before and after is properly decoded, but obviously in most cases the subsequent image after one of these mystery future codes is ruined by virtue of shifting the rest of the decoded bytes by a few places. That means that my reading of the 9 to 12-bit code stream is correct, so this really means that I see a 364 code right after a 256 dictionary-clearing code.

Edit: Here's an example file that contains such weird codes. I've also found a small TIFF LZW loading library that suffers from the same problem, it crashes where my loader finds the first weird code in this image (code 3073 when the dictionary only goes up to 2051). The good thing is that since it's a small library you can test it with the following code:

#include "loadtiff.h"
#include "loadtiff.c"
void loadtiff_test(char *path)
{
    int width, height, format;
    floadtiff(fopen(path, "rb"), &width, &height, &format);
}

And if anyone insists on diving into my code (which should be unnecessary, and it's a big library) here's where to start.

This is probably a stupid question, but I guess you've read how libtiff does it? This seems to be about the right spot: https://gitlab.com/libtiff/libtiff/blob/master/libtiff/tif_lzw.c#L442 — jcupitt, Apr 14 '19 at 12:49
@cgohlke I tried to make it quite clear in the question that it can't possibly be an issue with tracking code size. It's clear that it's a theory issue, ImageMagick clearly thinks it's possible to have such high codes and most decoders seem to know how to handle it fine. — Michel Rouzic, Apr 15 '19 at 05:12
@jcupitt I did, although I try to avoid trying to draw conclusions based on code that doesn't strive for clarity. But it seems that at least in the case of following a clear code libtiff would consider a following code above 255 to be corrupt, which is also what my code does but this doesn't help. — Michel Rouzic, Apr 15 '19 at 05:18
@MichelRouzic That file has a second LZW decoder below for an older version of the codec standard, perhaps that's the one that IM targets? — jcupitt, Apr 15 '19 at 09:13
Good news, I found another small library written by someone else that suffers from the same problem (see edited answer), I also posted some code to test that library and an example file that makes it crash. — Michel Rouzic, Apr 15 '19 at 13:40
@jcupitt It seems to do the same thing if a code above 256 follows a code 256, which is claim that the LZW table is corrupted. Maybe I should try and test libtiff for myself, because by the looks of it it seems like it should choke on those weird files too. — Michel Rouzic, Apr 15 '19 at 15:10
@MichelRouzic I tried your example file with my library (which uses libtiff for TIFF load) and it works fine for me. ImageMagick also uses libtiff for TIFF write, so I think the secret must be in there. I would try `convert strange.tif x.png` in a debugger and then watch the execution of `LZWDecode` (or whatever ends up being called). — jcupitt, Apr 15 '19 at 15:42
@cgohlke Good find, that does sound quit similar! However I'm not sure I understand the solution, he talks about what to do when the table is full, but I just checked and my table is never full, `new_index` (the next index to write in the dictionary/table) doesn't even reach 4000. Not to mention I've had unexpectedly large codes right after the clear code. — Michel Rouzic, Apr 15 '19 at 17:00
Re "In one case the first code after the clear code (256) is 364": I ran the strips in the `imagemagick lzw 8.tif` file through two different LZW decoders (not libtiff) and there is no such case. One strip seems to be missing EOI. — cgohlke, Apr 16 '19 at 07:14
Good point although technically since the table entry at 256 is always empty then nothing gets actually written. As for the code 364 I was talking about a different file that I didn't provide. — Michel Rouzic, Apr 16 '19 at 12:38
I debugged ImageMagick's convert.exe as @jcupitt suggested, modified it so that it prints out the codes of strip #381 of the posted file, and you're right @cgohlke, in libtiff the decoding stops (without a EOI code) 6 codes before my weird codes appear. That's what `while (occ > 0)` does in the libtiff implementation, unlike what the TIFF spec says which is exactly what I did and made no mention of strips ending without a EOI. Thank you both! I don't like to blame myself so I'll blame the TIFF spec for leaving this out and libtiff's lack of comments. @cgohlke do you want to write the answer? — Michel Rouzic, Apr 16 '19 at 14:59
Yep, in the spec's decoding pseudo code (which I based my own code on, and clearly so did tiffloader's author) the very only thing that can stop decoding is an EOI code. But libTIFF's encoder doesn't see it that way, so that effectively makes it an undocumented format feature. — Michel Rouzic, Apr 16 '19 at 15:22
Of course, but what I'm saying is the TIFF specification in Section 13 fails to mention that. Or did I miss it? Anyway, would you like to write the answer to this question? Otherwise I'll do it. — Michel Rouzic, Apr 16 '19 at 17:20
Will do. In hindsight it should indeed be obvious, but I thought about it and couldn't think of how to avoid reading too much (I was thinking about how to limit at the input, I didn't think of counting the output size), so I didn't do it. But again the spec's pseudo code is explicit enough that you'd think it would include such a consideration, but oh well. At least it's clear to me now. — Michel Rouzic, Apr 16 '19 at 18:59

score 2 · Accepted Answer · answered Apr 16 '19 at 20:37

The bogus codes come from trying to decode more than we're supposed to. The problem is that a LZW strip may sometimes not end with an End-of-Information 257 code, so the decoding loop has to stop when a certain number of decoded bytes have been output. That number of bytes per strip is determined by the TIFF tags ROWSPERSTRIP * IMAGEWIDTH * BITSPERSAMPLE / 8, and if PLANARCONFIG is 1 (which means interleaved channels as opposed to planar), by multiplying it all by SAMPLESPERPIXEL. So on top of stopping the decoding loop when a code 257 is encountered the loop must also be stopped after that count of decoded bytes has been reached.

Decoding TIFF LZW codes not yet in the dictionary

1 Answers1