LZW compression doesn't seem to work correctly

Question

I am trying to get this code work work properly, but when I try to encode things it doesn't seem to work as it should. I have a text file thats 60bytes. I encode it and the outputted file is 100 bytes. When I decode that file it goes to like 65bytes. It decodes properly but the file size is larger than the original. I tried encode a jpg and the file size did go down, however I couldn't open the file afters. I tried to decode the jpg file and it didn't work, seemed like cmd had frozen. This is the code I was trying to use.

import java.util.*;
import java.io.*;

public class LZW {

// Dictionary 
public static short DSIZE = 256;
public static int DSIZEINT = 256;

/** Compress a string to a list of output symbols. */
public static List<Short> compress(String uncompressed) {
    // Build the dictionary.
    short dictSize = DSIZE;
    Map<String,Short> dictionary = new HashMap<String,Short>();
    for (short i = 0; i < DSIZE; i++)
        dictionary.put("" + (char)i, i);

    String w = "";
    List<Short> result = new ArrayList<Short>();
    for (char c : uncompressed.toCharArray()) {
        String wc = w + c;
        if (dictionary.containsKey(wc))
            w = wc;
        else {
            result.add(dictionary.get(w));
            // Add wc to the dictionary.
            dictionary.put(wc, dictSize++);
            w = "" + c;
        }
    }

    // Output the code for w.
    if (!w.equals(""))
        result.add(dictionary.get(w));
    return result;
}

 /** Compress a string to a list of output symbols, supporting larger filesizes. */
public static List<Integer> compressInt(String uncompressed) {
    // Build the dictionary.
    int dictSize = DSIZEINT;
    Map<String,Integer> dictionary = new HashMap<String,Integer>();
    for (int i = 0; i < DSIZEINT; i++)
        dictionary.put("" + (char)i, i);

    String w = "";
    List<Integer> result = new ArrayList<Integer>();
    for (char c : uncompressed.toCharArray()) {
        String wc = w + c;
        if (dictionary.containsKey(wc))
            w = wc;
        else {
            result.add(dictionary.get(w));
            // Add wc to the dictionary.
            dictionary.put(wc, dictSize++);
            w = "" + c;
        }
    }

    // Output the code for w.
    if (!w.equals(""))
        result.add(dictionary.get(w));
    return result;
}

/** Decompress a list of output ks to a string. */
public static String decompress(List<Short> compressed) {
    // Build the dictionary.
    short dictSize = DSIZE;
    Map<Short,String> dictionary = new HashMap<Short,String>();
    for (short i = 0; i < DSIZE; i++)
        dictionary.put(i, "" + (char)i);

    String w = "" + (char)(short)compressed.remove(0);
    String result = w;
    for (short k : compressed) {
        String entry;
        if (dictionary.containsKey(k))
            entry = dictionary.get(k);
        else if (k == dictSize)
            entry = w + w.charAt(0);
        else
            throw new IllegalArgumentException("Bad compressed k: " + k);

        result += entry;

        // Add w+entry[0] to the dictionary.
        dictionary.put(dictSize++, w + entry.charAt(0));

        w = entry;
    }
    return result;
}

/** Decompress a list of output ks to a string, supporting larger filesizes. */
public static String decompressInt(List<Integer> compressed) {
    // Build the dictionary.
    int dictSize = DSIZE;
    Map<Integer,String> dictionary = new HashMap<Integer,String>();
    for (int i = 0; i < DSIZE; i++)
        dictionary.put(i, "" + (char)i);

    String w = "" + (char)(int)compressed.remove(0);
    String result = w;
    for (int k : compressed) {
        String entry;
        if (dictionary.containsKey(k))
            entry = dictionary.get(k);
        else if (k == dictSize)
            entry = w + w.charAt(0);
        else
            throw new IllegalArgumentException("Bad compressed k: " + k);

        result += entry;

        // Add w+entry[0] to the dictionary.
        dictionary.put(dictSize++, w + entry.charAt(0));

        w = entry;
    }
    return result;
}

public static void main(String[] args) {

    String example = "";
    String s = "";
    int command = 0;

    //Check for correct argument
    if(args.length != 1) {
        System.out.println("Please enter 1 argument.\nArg1: Command ('encode', 'decode', 'encodeInt', 'decodeInt')\nAnd ensure that you are feeding in an input file and output file using '<' and '>'");
        System.exit(1);
    }
    if(args[0].equals("encode")){
        command = 1;
    }
    else if(args[0].equals("decode")){
        command = 2;
    }
    else if(args[0].equals("encodeInt")){
        command = 3;
    }
    else if(args[0].equals("decodeInt")){
        command = 4;
    }
    else {
        System.out.println("Please use either 'encode', 'decode', 'encodeInt', 'decodeInt' as the argument.");
        System.exit(1);
    }

    long start;
    long elapsedTime;

    //Compress
    if(command == 1){

        //Read input file
        s = BinaryStdIn.readString();

        //The actual compression
        start = System.nanoTime();
        List<Short> compressed = compress(s);
        elapsedTime = System.nanoTime() - start;

        //System.err.println(compressed);

        //first writes the number of ints to write
        BinaryStdOut.write(compressed.size());
        //writes compression (to file)
        Iterator<Short> compressIterator = compressed.iterator();
        while (compressIterator.hasNext()){
            BinaryStdOut.write(compressIterator.next());
        }

        System.err.println("LZW Encode time: " + elapsedTime + " ns");

    }
    //Decompress
    else if(command == 2){

        //Build Integer List with input
        List<Short> compressed = new ArrayList<Short>();
        int size = BinaryStdIn.readInt();
        while(size > 0){
            try{
                compressed.add(BinaryStdIn.readShort());
            }
            catch(RuntimeException e){
                System.err.print("*");
            }
            size--;
        }

        //System.err.println(compressed);

        //The actual decompression
        start = System.nanoTime();
        String decompressed = decompress(compressed);
        elapsedTime = System.nanoTime() - start;

        //Print out decompressed data (to file)
        System.out.println(decompressed);

        System.err.println("LZW Decode time: " + elapsedTime + " ns");

    }
    //Compress using Integer size
    else if(command == 3){

        //Read input file
        s = BinaryStdIn.readString();

        //The actual compression
        start = System.nanoTime();
        List<Integer> compressed = compressInt(s);
        elapsedTime = System.nanoTime() - start;

        //System.err.println(compressed);

        //first writes the number of ints to write
        BinaryStdOut.write(compressed.size());
        //writes compression (to file)
        Iterator<Integer> compressIterator = compressed.iterator();
        while (compressIterator.hasNext()){
            BinaryStdOut.write(compressIterator.next());
        }

        System.err.println("LZW Encode time: " + elapsedTime + " ns");

    }
    //Decompress using Integer size
    else if(command == 4){

        //Build Integer List with input
        List<Integer> compressed = new ArrayList<Integer>();
        int size = BinaryStdIn.readInt();
        while(size > 0){
            try{
                compressed.add(BinaryStdIn.readInt());
            }
            catch(RuntimeException e){
                System.err.print("*");
            }
            size--;
        }

        //System.err.println(compressed);

        //The actual decompression
        start = System.nanoTime();
        String decompressed = decompressInt(compressed);
        elapsedTime = System.nanoTime() - start;

        //Print out decompressed data (to file)
        System.out.println(decompressed);

        System.err.println("LZW Decode time: " + elapsedTime + " ns");

    }

    BinaryStdOut.close();


}
}

Appreciate any help. Thanks.

I'm not going to debug your app for you, but having written compression routines before, I'll give you a good way to go about testing it: Start with very small files. 1 character, 2 characters, 3 characters, 4 characters, etc. Try variations of repeated letters and letter sequences. Make each test a little more complex than the last. Compress each one, decompress and see if it matches. If it matches, go on to the next. If it doesn't, figure out what's wrong. It's much easier to test with a small test file than with a jpeg. — Pete, May 01 '14 at 14:55
@Pete thanks for the advice. makes more sense to do that rather than to use jpg from the get go. However, the text file seems to encode fine and decode fine, my problem is that the file size increases rather than decreases. — Ayohaych, May 01 '14 at 14:58
The best way to find the problem is incrementally. Keep making the files bigger until you run into the problem. Try different variations. Maybe it's a certain sequence of values that's causing the issue. You could even write an app that creates the text files of ever increasing size (and complexity) and have it automatically compress and decompress the data, check for a match and tell you exactly when you run into the issue. It may happen when you reach a particular size. For example, the cutoff might be 255 bytes or 256 bytes, in which case you may have a bounds error or an off by one error. — Pete, May 01 '14 at 15:02
Saying "the text file seems to encode fine" and "the file size increases" is contradictory. If the file is larger, it's because it has characters in it that weren't in the original. — Mark Ransom, May 01 '14 at 15:08
@MarkRansom Yeah but if I am trying to compress it, the file size should be smaller. So when I encode a file, and decode the encoded file, it gives me the original okay. I'm just not sure as to why the file size goes up rather than down with the encoding — Ayohaych, May 01 '14 at 15:13
Not necessarily. Some things will actually not get smaller if you compress them. Try zipping a 1 byte file and it will be larger than 1 byte. This is due to metadata required to decompress. — Pete, May 01 '14 at 15:38
@Pete ah i see. Anoter issue is when I try to decode a file that is a few kb, it does it fine, however I tried it with a 8mb file and it seemed like cmd had frozen. Nothing happened for at least an hour. Does it just take super long for that size a file or would something else be wrong? — Ayohaych, May 01 '14 at 16:58
I'm not sure what your environment is. I work with Visual Studio and C# and I can perform a "break all" which stops execution arbitrarily and then shows you where things are in terms of execution (current executing instructions of each thread and call stacks and so forth). If your environment has this you could try that. Another possibility is to add logging messages to your code that writes out information at various points of the execution. You might take a look at something like log4j — Pete, May 01 '14 at 19:56
@Pete I'm just using Notepad++ and CMD because the lecture said to not rely on an IDE for the assignment, so if it doesn't work in CMD we won't get marked. I think I'm getting somewhere with it though — Ayohaych, May 01 '14 at 20:49
Good luck with it. Here's one more testing strategy. Try to produce the smallest test file you can produce where the decompressed file is different from the pre-compressed file. Look at the decompressed file and see where the corruption begins (say at the 433rd byte). Then that's the point at which you want to start debugging it, so add something like: `if (result.length >= 430) {...}` and add a dummy statement to put a breakpoint on. So it basically gives you a point at which you can start debugging and step through to see where things go sour. — Pete, May 01 '14 at 21:38

Mark Ransom · Accepted Answer · 2014-05-01T15:44:15.990

1

Even the best compression algorithm will occasionally create an output that's larger than the input. In fact it makes a good test case to find such input. LZW compresses by finding repeated sequences, so an input without any repeating sequences will by necessity get larger.

I once had to create a test input like this. I think it went something like "ABCD...ACBDEG...".

Edit: now that I look more closely at the code, I see that you're writing a list of Shorts to the output. That's almost certainly wrong; one of the necessary steps is to pack each output token into the smallest number of bits, and you're missing that step entirely.

Judging by your description the code has other problems too, but one's enough for now.

edited May 01 '14 at 15:44

answered May 01 '14 at 15:24

Mark Ransom

299,747
42
398
622

Yeah that does make sense. I made a test with lots of repitition and the output file was larger still. Do you have an example input I could try that would certainly give me a smaller file size? Or could it be a different problem? EDIT I tried it with a short story I found on the web and it reduced the file size by half! If the lecturer is to test this code with small files, it wont work really, so is this normal for LZW? – Ayohaych May 01 '14 at 15:31

LZW compression doesn't seem to work correctly

1 Answers1