I'm taking an algorithms class where we have to implement LZW compression in Java. I decided to use a Trie data structure for this, and I've already implemented the Trie and got it working, but is very slow, and it barely compresses.
We are supposed to use 8-bit symbols and be able to compress any binary file.
Given a ~4MB file (bible.txt), I get around 549,012 elements in my codes array. When I write these elements to a file (one integer code per line), I end up with a "compressed" file of 3.5MB, so I get .5MB of compression.
How can I make this program more efficient? I feel like I misunderstood something fundamental here, or I'm missing something obvious, but I'm out of ideas on why this doesn't compress.
(I got my test file bible.txt from this website: https://corpus.canterbury.ac.nz/descriptions/)
I read bytes from a binary file like this (reading as int and converting to char is necessary so that values above 0x80 are not negative):
public String readFile(String path) throws IOException, FileNotFoundException {
File file = new File(path);
StringBuilder string = new StringBuilder();
try (FileInputStream fileInputStream = new FileInputStream(file)) {
int singleCharInt;
char singleChar;
while((singleCharInt = fileInputStream.read()) != -1) {
singleChar = (char) singleCharInt;
string.append(singleChar);
}
}
return string.toString();
}
My main method looks like this:
public static void main(String args[]) throws FileNotFoundException, IOException {
String bytes = new FileReader().readFile("/home/user/Code/Trie/bible.txt");
ArrayList<Integer> codes = new Compress().compress(bytes);
}
My Compress class looks like this:
public class Compress {
private int code = 0;
public ArrayList<Integer> compress(String data) {
Trie trie = new Trie();
// Initialize Trie Data Structure with alphabet (256 possible values with 8-bit
// symbols)
for (code = 0; code <= 255; code++) {
trie.insert(Character.toString((char) code), code);
}
code++;
String s = Character.toString(data.charAt(0));
ArrayList<Integer> codes = new ArrayList<Integer>();
for (int i = 1; i < data.length(); i++) {
String c = Character.toString(data.charAt(i));
if (trie.find(s + c) > 0) {
s += c;
} else {
codes.add(trie.find(s));
trie.insert(s + c, code);
code++;
s = c;
}
}
codes.add(trie.find(s));
return codes;
}
}
My Trie class looks like this:
public class Trie {
private TrieNode root;
public Trie() {
this.root = new TrieNode(false);
}
public void insert (String word, int code) {
TrieNode current = root;
for (char l: word.toCharArray()) {
current = current.getChildren().computeIfAbsent(Character.toString(l), c -> new TrieNode(false));
}
current.setCode(code);
current.setWordEnd(true);
}
public int find(String word) {
TrieNode current = root;
for (int i = 0 ; i < word.length(); i++) {
char ch = word.charAt(i);
TrieNode node = current.getChildren().get(Character.toString(ch));
if (node == null) {
return -1;
}
current = node;
}
return current.getCode();
}
}
My TrieNode class looks like this:
public class TrieNode {
private HashMap<String, TrieNode> children;
private int code;
private boolean wordEnd;
public TrieNode(boolean wordEnd) {
this.children = new HashMap<String, TrieNode>();
this.wordEnd = wordEnd;
}
public HashMap<String, TrieNode> getChildren() {
return this.children;
}
public void setWordEnd(boolean wordEnd) {
this.wordEnd = wordEnd;
}
public boolean isWordEnd() {
return this.wordEnd;
}
public int getCode() {
return this.code;
}
public void setCode(int code) {
this.code = code;
}
}
Thank you for your time!