2

I implemented the following two functions for RLE compression of binary files.

char* RLEcompress(char* data, size_t origSize, size_t* compressedSize) {
    char* ret = calloc(2 * origSize, 1);
    size_t retIdx = 0, inIdx = 0;
    size_t retSize = 0;
    while (inIdx < origSize) {
        size_t count = 1;
        size_t contIdx = inIdx;
        while (contIdx < origSize - 1 && data[inIdx] == data[++contIdx]) {
            count++;
        }
        size_t tmpCount = count;

        // break down counts with 2 or more digits into counts ≤ 9
        while (tmpCount > 9) {
            tmpCount -= 9;
            ret[retIdx++] = data[inIdx];
            ret[retIdx++] = data[inIdx];
            ret[retIdx++] = '9';
            retSize += 3;
        }

        ret[retIdx++] = data[inIdx];
        retSize += 1;
        if (tmpCount > 1) {
            // repeat character (this tells the decompressor that the next digit
            // is in fact the # of consecutive occurrences of this char)
            ret[retIdx++] = data[inIdx];
            // convert single-digit count to dataing
            ret[retIdx++] = '0' + tmpCount;
            retSize += 2;
        }

        inIdx += count;
    }
    *compressedSize = retSize;
    return ret;
}

char* RLEdecompress(char* data, size_t compressedSize, size_t uncompressedSize, size_t extraAllocation) {
    char* ret = calloc(uncompressedSize + extraAllocation, 1);
    size_t retIdx = 0, inIdx = 0;
    while (inIdx < compressedSize) {
        ret[retIdx++] = data[inIdx];
        if (data[inIdx] == data[inIdx + 1]) { // next digit is the # of occurrences
            size_t occ = ((data[inIdx + 2]) - '0');
            for (size_t i = 1; i < occ && retIdx < compressedSize; i++) {
                ret[retIdx++] = data[inIdx];
            }
            inIdx += 2;
        }
        inIdx += 1;
    }
    return ret;
}

They seem to work fine, i.e. diff doesn't produce any output when comparing the original files to the compressed-then-uncompressed versions.

However, every once in a while, the files will differ indicating there is a bug somewhere. I haven't been able to find a pattern in the files that exhibit this, but I'll give you an example of what the difference looks like.

enter image description here

The lower one is the original.

As you can see, the byte 21 is repeated twice in the compressed-then-uncompressed version. I haven't been able to identify the issue. Unfortunately the bug happens with very few files: so far I've only observed it with two pdf files, including the one shown above, but I can't share them because it's copyrighted content, but I'm working on coming up with another file that fails so I can provide you with an example.

I have a feeling there is something "obvious" wrong with the code above and I'm just missing it. Help is greatly appreciated.

EDIT:

Here's a test program I'm using to read the offending file, compressing it, then decompressing it. I'm also saving the compressed one to disk in a middle step to have more debug data.

int main(int argc, char** argv) {
    size_t compsz;
    FILE* fp = fopen(argv[1], "r");
    if (!fp) {
        perror("fp");
        return 1;
    }

    if (fseek(fp, 0L, SEEK_END) == -1) {
        return -1;
    }
    // get file size
    size_t filecontentLen = ftell(fp);
    if (filecontentLen < 0) {
        return -1;
    }
    rewind(fp);

    char* filecontentBuf = calloc(filecontentLen, 1);
    if (!filecontentBuf) {
        fclose(fp);
        errno = ENOMEM;
        return -1;
    }
    // read original
    if (fread(filecontentBuf, sizeof(char), filecontentLen, fp) <= 0) {
        int errnosave = errno;
        if (ferror(fp)) {
            fclose(fp);
            free(filecontentBuf);
            errno = errnosave;
            return -1;
        }
    }
    // write compressed
    char* compressed = RLEcompress(filecontentBuf, filecontentLen, &compsz);
    FILE* fpcompWrite = fopen("compressed", "w+");
    if (fwrite(compressed, compsz, 1, fpcompWrite) == -1) {
        perror("fwrite");
    }
    fclose(fpcompWrite);

    // read compressed
    FILE* fpcompRead = fopen("compressed", "r");
    if (!fpcompRead) {
        perror("fpcompRead");
        return 1;
    }
    char* compBuf = calloc(compsz * 2, 1);
    fread(compBuf, compsz, 1, fpcompRead);
    fclose(fpcompRead);

    // decompress and write file
    char* uncompBuf = RLEdecompress(compBuf, compsz, filecontentLen, 0);
    FILE* funcomp = fopen("uncompressed", "w+");
    fwrite(uncompBuf, filecontentLen, 1, funcomp);
    fclose(funcomp);
}
Samuele B.
  • 481
  • 1
  • 6
  • 29
  • Did you take the few bytes around the error in the file you show and compress/decompress just that part of the file? Would be interesting to find at which point the error appears. – fpiette May 22 '21 at 08:58
  • 1
    Please show how you write the files. Note that you should write the size of the uncompressed data, not the size of the buffer. – Paul Ogilvie May 22 '21 at 08:59
  • @PaulOgilvie I write the uncompressed file on a socket and the receiving client saves it to disk, so it's kinda hard to show that part concisely. I do write using the uncompressed size though, that's for sure – Samuele B. May 22 '21 at 09:07
  • 'write the uncompressed file on a socket and the receiving client saves it to disk' thar's a bridge too far. How can you be sure that the problem is in the code you posted? – Martin James May 22 '21 at 09:11
  • Maybe change to safe your file locally, so it is all under your control. – Paul Ogilvie May 22 '21 at 09:12
  • Also: if you have a file with the problem, when you repeat the decompression, does the error show up again? If yes, when you also repeat the compression and then the decompression, does the error show up? This can help to determine if the error is in the decompressor or the compressor. – Paul Ogilvie May 22 '21 at 09:14
  • If the error shows up consistently, then it is a logic error. Otherwise it is a memory error (out of bounds writing). – Paul Ogilvie May 22 '21 at 09:16
  • @MartinJames I have just setup a test program that reads the offending file, compresses it, then uncompresses it, and saves it to disk, and it's exhibiting the same issue, so I can confirm it is indeed the code I posted above that's failing – Samuele B. May 22 '21 at 09:30
  • @PaulOgilvie can you please elaborate on the "when you repeat the decompression, does the error show up again?" part? If I compress the file again, after a cycle of comp-uncomp, I'm already working with a damaged file to begin with – Samuele B. May 22 '21 at 09:31
  • You compressed the file. You decompress it and it shows an error. Now decompress it again - does it show the error again? – Paul Ogilvie May 22 '21 at 09:36
  • Please publish complete source code for your test program (Edit your question for that purpose). If needed, reduce it to the bare minimum. – fpiette May 22 '21 at 09:37
  • Shouldn't you open in `"rb"` mode??? – Paul Ogilvie May 22 '21 at 09:42
  • @PaulOgilvie nope, linux ignores the `b` thing in `fopen`. All files are read as binary in a compliant system – Samuele B. May 22 '21 at 09:43
  • 1
    At least your code is then not portable. – Paul Ogilvie May 22 '21 at 09:44
  • Formally, you should allocate `compBuf` as the file size, not as `compsz`. – Paul Ogilvie May 22 '21 at 09:46
  • Same with decompressing: you assume `filecontentLen` is correct. These assumptions may well be the source of the error, or may hide the error. – Paul Ogilvie May 22 '21 at 09:48
  • In your test code, maybe decompressing the compressed buffer with passing by file write and read would be useful. – fpiette May 22 '21 at 10:06
  • So, wait... this compresses _binary_ files, but uses _text digits_ from 0 to 9 for the amount of repeats, while it could be using that same single byte for a full 00-FF byte value? Why? – Nyerguds May 26 '21 at 13:02

1 Answers1

2

I think the problem is that

for (size_t i = 1; i < occ && retIdx < compressedSize; i++) {
    ret[retIdx++] = data[inIdx];
}

should be changed in

for (size_t i = 1; i < occ && retIdx < uncompressedSize; i++) {
    ret[retIdx++] = data[inIdx];
}

in the decompression algorithm, since redIdx is bounded by uncompressedSize, and maybe in some rare cases it copies fewer bytes than it should.

gio54321
  • 66
  • 3
  • Good catch! And I find it plainly _wrong_ that the decompressor must know the size of the uncompressed file. The caller cannot know this size. The decompressor should use expanding buffers. – Paul Ogilvie May 22 '21 at 10:03
  • @PaulOgilvie this is true in general, but in the context of my application (which, as I mentioned, is based on communication via socket), the whole architecture is based around the fact that file sizes are known in advance, so I can make that assumption and save myself the hassle of all those reallocs. – Samuele B. May 22 '21 at 10:26