LZW encoding for large file

Question

I am building an LZW encoding algorithm, which uses dictionary and hashing so it can reach fast enough for working words already stored in a dictionary.

The algorithm gives proper results when ran on smaller files (cca few hundreds of symbols), but on the larger files (and especially in those files which contain of less different symbols - for example, it gives the worst performance when ran on a file which consists only of 1 symbol, 'y' let's say). The worst performance, in terms that it just crashes when dictionary is not even close to being full. However, when the large input file consists of more than 1 symbol, dictionary gets close to being full, approximately 90%, but again then it crashes.

Considering the structure of my algorithm, I am not quite sure what is causing it to crash in general, or crash so soon when large file of just 1 symbol is given. It must be something about hashing (first time doing it, so it might have some bugs).

The hash function I am using can be found here, and from what I have tested it, it gives good results: oat_hash

LZW encoding algorithm is based on this link, with slight change, that it works until the dictionary is not full: LZW encoder

Let's get into code:

Note: oat_hash is changed so it returns value % CAPACITY, so every index is from DICTIONARY

    // Globals
#define CAPACITY 100000
char *DICTIONARY[CAPACITY];
unsigned short CODES[CAPACITY]; // CODES and DICTIONARY are linked via index: word from dictionary on index i, has its code in CODES on index i
int position = 0;
int code_counter = 0;

void encode(FILE *input, FILE *output){

int succ1 = fseek(input, 0, SEEK_SET);
if(succ1 != 0) printf("Error: file not open!");

int succ2 = fseek(output, 0, SEEK_SET);
if(succ2 != 0) printf("Error: file not open!");

//1. Working word = next symbol from the input
char *working_word = malloc(2048*sizeof(char));
char new_symbol = getc(input);
working_word[0] = new_symbol;
working_word[1] = '\0';



//2. WHILE(there are more symbols on the input) DO
//3. NewSymbol = next symbol from the input
while((new_symbol = getc(input)) != EOF){

    char *workingWord_and_newSymbol= NULL;
    char newSymbol[2];
    newSymbol[0] = new_symbol;
    newSymbol[1] = '\0';

    workingWord_and_newSymbol = working_word_and_new_symbol(working_word, newSymbol);

    int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));

    //4. IF(WorkingWord + NewSymbol) is already in the dictionary THEN
    if(DICTIONARY[index] != NULL){
        // 5. WorkingWord += NewSymbol
        working_word = working_word_and_new_symbol(working_word, newSymbol);

    }
    //6. ELSE
    else{
        //7. OUTPUT: code for WorkingWord
        int idx = oat_hash(working_word, strlen(working_word));

        fprintf(output, "%u", CODES[idx]);

        //8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
        if(!dictionary_full()){
            DICTIONARY[index] = workingWord_and_newSymbol;
            CODES[index] = code_counter + 1;
            code_counter += 1;
            working_word = strdup(newSymbol);
        }else break;

    }
    //10. END IF
}
//11. END WHILE

//12. OUTPUT: code for WorkingWord
int index = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[index]);

free(working_word);

}

Another general remark: `new_symbol` needs to be an `int` to process EOF correctly (see http://c-faq.com/stdio/getcharc.html) — , Dec 13 '13 at 13:25
More general remarks: `working_word_and_new_symbol` looks like an important function you haven't shown us. Nobody can test your code until you make it a complete program. Your output stream is a bunch of `%u`'s with no separators, impossible to parse. Hash collisions are apparently not handled at all, and by the time you get to 90% full you've probably had some. — , Dec 13 '13 at 15:17

score 1 · Answer 1 · answered Jan 21 '14 at 08:16

 int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));

And later

    int idx = oat_hash(working_word, strlen(working_word));

    fprintf(output, "%u", CODES[idx]);

    //8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
    if(!dictionary_full()){
        DICTIONARY[index] = workingWord_and_newSymbol;
        CODES[index] = code_counter + 1;
        code_counter += 1;
        working_word = strdup(newSymbol);
    }else break;

idx and index are unbounded and you use them to access a bounded array. You're accessing memory out of range. Here's a suggestion, but it may skew the distribution. If your hash range is much larger than CAPACITY it shouldn't be a problem. But you also have another problem which was mentioned, collisions, you need to handle them. But that's a different problem.

int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol)) % CAPACITY;
// and
int idx = oat_hash(working_word, strlen(working_word)) % CAPACITY;

chux - Reinstate Monica · Answer 2 · 2013-12-13T13:50:40.143

0

LZW compression is certainly used to construct binary files and normally is capable of reading binary files.

The following code is problematic as it relies on new_symbol never being a \0.

newSymbol[0] = new_symbol; newSymbol[1] = '\0';
strlen(workingWord_and_newSymbol)
strdup(newSymbol)

Needs re-write to work with arrays of bytes rather than strings.

fopen() was not shown. Insure one is opening in binary. input = fopen(..., "rb");

@Wumpus Q. Wumbley is correct, use int newSymbol.

Minor:

new_symbol and newSymbol are confusing.

Consider:

// char *working_word = malloc(2048*sizeof(char));
#define WORKING_WORD_N (2048)
char *working_word = malloc(WORKING_WORD_N*sizeof(*working_word));
// or 
char *working_word = malloc(WORKING_WORD_N);

edited Dec 13 '13 at 13:50

answered Dec 13 '13 at 13:42

chux - Reinstate Monica

143,097
13
135
256

Thank you for your suggestions, but they don't really solve my problem. The question is why this codes crashes for large input file with only 1 or 2 different symbols, but works for smaller input files with larger symbol diversity. How is string not an array of bytes? – Whizzil Dec 13 '13 at 14:12
"A string is a contiguous sequence of characters terminated by and including the first null character." A string _always_ has one and only one `\0` at the end. "An array of bytes" is any combination of any bytes including ones with multiple `\0`. It amazing you have already coded an "array of bytes" vs. "string" solution so fast to know this did not solve you problem. Consider looking to see if `working_word` over-fills. `working_word = strdup(newSymbol)` likely contributes to a limit problem, it's not 2048 anymore. – chux - Reinstate Monica Dec 13 '13 at 14:36
I can't see how this would affect results, since the input symbols are never '\0', so no byte array will be null-terminated earlier then it should. My symbols are from english alphabet, and consequently so are the working words. '\0' is never present, except at the end of a string. – Whizzil Dec 16 '13 at 12:16

LZW encoding for large file

2 Answers2