0

i searched the web for an implementation of a levenshtein trie and i found this: Levenshtein Distance Challenge: Causes. i tried to add a piece of code to normalize the words. If a word for example has 5 letters ('Apple') and i have this word ('Aple') the distance is 1 and i accept it as the same. When i for example have a much longer word ('circumstances') you can make more mistakes. If you have two mistakes in this word the original code would calculate the minimum distance to be 2 and wont accept it. So i want to use a logarithm. With the logarithm the distance between 'circumstances' and 'kirkumstances' would be smaller than 2 and because of the cast to int it would be 1. Thats what i want to do.

public class LevenshteinTrie {
    private int distance = -1;
    private Trie trie = null;

    public LevenshteinTrie(int distance, Set<String> words) {
        this.distance = distance;
        this.trie = new Trie();

        for(String word : words) {
            this.trie.insert(word);
        }
    }

    public Set<String> discoverFriends(String word, boolean normalized) {
        Set<String> results = new HashSet<String>();

        int[] currentRow = new int[word.length() + 1];

        List<Character> chars = new ArrayList<Character>(word.length());

        for(int i = 0; i < word.length(); i++) {
            chars.add(word.charAt(i));
            currentRow[i] = i;
        }

        currentRow[word.length()] = word.length();

        for(Character c : this.trie.getRoot().getChildren().keySet()) {
            this.traverseTrie(this.trie.getRoot().getChildren().get(c), c, chars, currentRow, results, normalized);
        }

        return results;
    }

    private void traverseTrie(TrieNode node, char letter, List<Character> word, int[] previousRow, Set<String> results, boolean normalized) {
        int size = previousRow.length;
        int[] currentRow = new int[size];

        currentRow[0] = previousRow[0] + 1;

        int minimumElement = currentRow[0];

        int insertCost = 0;
        int deleteCost = 0; 
        int replaceCost = 0;

        for(int i = 1; i < size; i++) {
            insertCost = currentRow[i - 1] + 1;
            deleteCost = previousRow[i] + 1;

            if(word.get(i - 1) == letter) {
                replaceCost = previousRow[i - 1];
            } else {
                replaceCost = previousRow[i - 1] + 1;
            }

            currentRow[i] = Math.min(Math.min(insertCost, deleteCost), replaceCost);

            if(currentRow[i] < minimumElement) {
                if(normalized) {
                    minimumElement = (int)(currentRow[i] / (Math.log10(word.size())));
                } else {
                    minimumElement = currentRow[i];
                }
            }
        }

        int tempCurrentRow = currentRow[size - 1];

        if(normalized) {
            tempCurrentRow = (int)(currentRow[size - 1] / (Math.log10(word.size())));
        }

        System.out.println(tempCurrentRow);

        if(tempCurrentRow <= this.distance && node.getWord() != null) {
            results.add(node.getWord());
        }

        if(minimumElement <= this.distance) {
            for(Character c : node.getChildren().keySet()) {
                this.traverseTrie(node.getChildren().get(c), c, word, currentRow, results, normalized);
            }
        }
    }
}

public class Trie {
    private TrieNode root = null;;

    public Trie() {
        this.root = new TrieNode();
    }

    public void insert(String word) {
        TrieNode current = this.root;

        if (word.length() == 0) {
            current.setWord(word);
        }

        for (int i = 0; i < word.length(); i++) {
            char letter = word.charAt(i);

            TrieNode child = current.getChild(letter);

            if (child != null) {
                current = child;
            } else {
                current.getChildren().put(letter, new TrieNode());
                current = current.getChild(letter);
            }

            if (i == word.length() - 1) {
                current.setWord(word);
            }
        }
    }
 }

public class TrieNode {
    public static final int ALPHABET = 26;
    private String word = null;
    private Map<Character, TrieNode> children = null;

    public TrieNode() {
        this.word = null;
        this.children = new HashMap<Character, TrieNode>(ALPHABET);
    }

    public TrieNode getChild(char letter) {
        if(this.children != null) {
            if(children.containsKey(letter)) {
                return children.get(letter);
            }
        }

        return null;
    }

    public String getWord() {
        return word;
    }
}

Unfortunately this code does not work correctly. I set the maximum distance to 1. When i now run the program and search for 'vdimir putin' (i have 'vladimir putin' in my trie) the program wont accept it as a friend. When i print out the temporary calculated distances it looks like that:

The tempCurrentRows when maximum distance = 1:

11
11
10
10
10
10
11
11
11
11
10
11
11
11
11
11
11
11
10
10
10
10
10
10
10
10
10
10
9
11
11
10
10
10
10

But when i set the maximum distance to 2 the temporary distances are changing:

The tempCurrentRows when maximum distance = 2:

11
11
11
10
10
10
10
9
9
8
7
6
5
4
3
2
1
11
11
10
10
9
9

So there must be a huge mistake in the code. But i dont get where and why and how i have to change the code to work as i want it to work.

Mulgard
  • 9,877
  • 34
  • 129
  • 232

2 Answers2

0

How did you implement the search for 'vdimir putin'? Your code seems correct. I tested it with:

public static void main(String[] args) {
    HashSet<String> words = new HashSet<String>();
    words.add("vdimir putin");
    LevenshteinTrie lt = new LevenshteinTrie(2, words);
    Set<String> friends = lt.discoverFriends("vladimir putin", false);
    System.out.println(friends.iterator().next());
}

this prints 'vdimir putin', which means "vladimir putin" has a friend with Levenshtein Distance 2

Fortega
  • 19,463
  • 14
  • 75
  • 113
  • I think this is more of a comment? – christopher Jun 13 '14 at 13:50
  • This is because you set the maximum distance to 2. if you add `System.out.println(tempCurrentRow);` before `if(tempCurrentRow <= this.distance && node.getWord() != null)` you can see that it is a big difference between maximumdistance = 1 and maximumdistance = 2. Because of the logarithm i implemented the distance of 2 would be accepted with a maximumdistance of 1. because the logarithm of 2 is smaller than 2 and java cuts it down to 1. exactly as i want it to be. – Mulgard Jun 13 '14 at 14:08
  • sorry i didnt mean the logarithm of 2 is smaller than 2. i meant 2 / Math.log10(wordlength) is smaller than 2. – Mulgard Jun 13 '14 at 14:22
  • @christopher maybe it is, but these big blocks of code in comments is unreadable – Fortega Jun 13 '14 at 14:29
  • @Mulgard I don't understand what you are saying here. Please add some test code (main method which prints something?) to the question in which we can see the distance of vdimir putin is 11, and not 2 – Fortega Jun 13 '14 at 14:30
  • i said it in my comment: add System.out.println(tempCurrentRow); before if(tempCurrentRow <= this.distance && node.getWord() != null) and run your test again. – Mulgard Jun 13 '14 at 14:38
  • and dont use 2 as maximum. use 1 – Mulgard Jun 13 '14 at 14:40
  • I edited my post. maybe its easier to understand now. – Mulgard Jun 13 '14 at 15:48
0

Oh, i guess if have to normalize the minimum element too:

if(normalized) {
    tempCurrentRow = (int)(currentRow[size - 1] / (Math.log10(word.size())));
    minimumElement = (int)(minimumElement / (Math.log10(word.size())));
}

And replace this:

 if(normalized) {
     minimumElement = (int)(currentRow[i] / (Math.log10(word.size())));
 } else {
     minimumElement = currentRow[i];
 }

with this:

minimumElement = currentRow[i];

With this small change it works find as i want it to work. When i now search for 'vdmir putin' and have a maximum distance of 1 he correctly finds 'vladimir putin'.

Mulgard
  • 9,877
  • 34
  • 129
  • 232