My LZW compression program barely compresses

Question

I'm taking an algorithms class where we have to implement LZW compression in Java. I decided to use a Trie data structure for this, and I've already implemented the Trie and got it working, but is very slow, and it barely compresses.

We are supposed to use 8-bit symbols and be able to compress any binary file.

Given a ~4MB file (bible.txt), I get around 549,012 elements in my codes array. When I write these elements to a file (one integer code per line), I end up with a "compressed" file of 3.5MB, so I get .5MB of compression.

How can I make this program more efficient? I feel like I misunderstood something fundamental here, or I'm missing something obvious, but I'm out of ideas on why this doesn't compress.

(I got my test file bible.txt from this website: https://corpus.canterbury.ac.nz/descriptions/)

I read bytes from a binary file like this (reading as int and converting to char is necessary so that values above 0x80 are not negative):

public String readFile(String path) throws IOException, FileNotFoundException {
    File file = new File(path);

    StringBuilder string = new StringBuilder();

    try (FileInputStream fileInputStream = new FileInputStream(file)) {
        int singleCharInt;
        char singleChar;
        while((singleCharInt = fileInputStream.read()) != -1) {
            singleChar = (char) singleCharInt;
            string.append(singleChar);
        }
    } 

    return string.toString();
}

My main method looks like this:

    public static void main(String args[]) throws FileNotFoundException, IOException {
        String bytes = new FileReader().readFile("/home/user/Code/Trie/bible.txt");

        ArrayList<Integer> codes = new Compress().compress(bytes);
    }

My Compress class looks like this:

public class Compress {

    private int code = 0;

    public ArrayList<Integer> compress(String data) {
        Trie trie = new Trie();

        // Initialize Trie Data Structure with alphabet (256 possible values with 8-bit
        // symbols)
        for (code = 0; code <= 255; code++) {
            trie.insert(Character.toString((char) code), code);
        }

        code++;

        String s = Character.toString(data.charAt(0));

        ArrayList<Integer> codes = new ArrayList<Integer>();

        for (int i = 1; i < data.length(); i++) {
            String c = Character.toString(data.charAt(i));

            if (trie.find(s + c) > 0) {
                s += c;
            } else {
                codes.add(trie.find(s));
                trie.insert(s + c, code);
                code++;
                s = c;
            }
        }

        codes.add(trie.find(s));

        return codes;
    }

}

My Trie class looks like this:

public class Trie {
    private TrieNode root;

    public Trie() {
        this.root = new TrieNode(false);
    }

    public void insert (String word, int code) {
        TrieNode current = root;

        for (char l: word.toCharArray()) {
            current = current.getChildren().computeIfAbsent(Character.toString(l), c -> new TrieNode(false));
        }
        current.setCode(code);
        current.setWordEnd(true);
    }

    public int find(String word) {
        TrieNode current = root;

        for (int i = 0 ; i < word.length(); i++) {
            char ch = word.charAt(i);

            TrieNode node = current.getChildren().get(Character.toString(ch));

            if (node == null) {
                return -1;
            }

            current = node;
        }

        return current.getCode();
    }
}

My TrieNode class looks like this:

public class TrieNode {
    private HashMap<String, TrieNode> children;
    private int code;
    private boolean wordEnd;

    public TrieNode(boolean wordEnd) {
        this.children = new HashMap<String, TrieNode>();
        this.wordEnd = wordEnd;
    }

    public HashMap<String, TrieNode> getChildren() {
        return this.children;
    }

    public void setWordEnd(boolean wordEnd) {
        this.wordEnd = wordEnd;
    }
    
    public boolean isWordEnd() {
        return this.wordEnd;
    }

    public int getCode() {
        return this.code;
    }

    public void setCode(int code) {
        this.code = code;
    }
}

Thank you for your time!

You need to start by understanding the differences between `int`, `char` and `byte`, how they represent data, and how Java converts between them. You get no compression because you are using 4 bytes (an `int`) to store every byte, i.e. taking 4x the space. Also, though I haven't examined your code in detail, you may also be introducing corruption with int-to-char conversion. Those fundamentals are enough for an entire chapter in a tutorial, and much too broad for Stack Overflow. — Jim Garrison, Jan 31 '21 at 19:04
@JimGarrison I don't understand what I'm supposed to use if int's are no good for storing codes. I get around 250,000 codes for a large file, which is bigger than byte and short. The next option is int, but you're saying those are too big, so I'm not sure what the alternative is. You are definitely right on my fundamentals lacking, though. — vimdiesel, Jan 31 '21 at 19:24
Internally you will end up using `int`, but the input values you are working with are all 1 byte long (0x00-0xFF), and since you anticipate 250k codes you will be using variable-length encoding. To actually write out the compressed data to a file you are going to be converting those codes to variable length bit strings, packing them in a buffer, and writing out the resulting bytes. NIO buffers will be useful here. There should be no `char` or `String` anywhere in your code because the input data must be treated as binary, without character set conversion. — Jim Garrison, Jan 31 '21 at 19:36

score 1 · Answer 1 · answered Jan 31 '21 at 19:45

What does this mean: "When I write these elements to a file (one integer code per line)"? You're writing four bytes to the file for each code? You're writing four bytes and new line? You're writing a number in decimal and a new line?

In any case, all of those are wrong. You need to store the codes as bits. In the usual LZW implementation, the number of bits in a code starts at 9, and then increments as more codes are created. Further into a file, a code might be, for example, 12 or 13 bits. The decoder knows from the data when the encoder incremented, so it always knows how many bits to get for the next code. Every once in a while it makes sense to reset back to 9 bits, which is signaled by the encoder to the decoder.

So how do you read and write bits to a file? You will quickly find that there are no functions for that. You need to write them yourself.

In short, you keep a buffer of bits in an integer, using the bit shift and or operations to add bits to the buffer, keeping track of how many bits are in the buffer in another integer. For encoding, after you add bits to the buffer, you see if there are at least 8 bits in there. If so, write out a byte to the file, and remove the 8 bits from the buffer. Repeat until there are less than 8 bits in the buffer.

Some care must be taken at the end to write out the last few bits into one more byte, making sure you have thought through how the decoder will know when to stop decoding bits.

Same thing on the decoder side, reading bytes from the input file and adding 8 bits at a time to your buffer until you have enough bits to pull for the next code.

My LZW compression program barely compresses

1 Answers1