0

I am currently implementing a BK-Tree to make a spell checker. The dictionary I am working with is very large (millions of words), which is why I cannot afford any inefficiencies at all. However, I know that the lookup function that I wrote (arguably the most important part of the entire program) can be made better. I was hoping to find some help regarding the same. Here's the lookup that I wrote:

public int get(String query, int maxDistance)
{
    calculateLevenshteinDistance cld = new calculateLevenshteinDistance();
    int d = cld.calculate(root, query);
    int tempDistance=0;

    if(d==0)
        return 0;

    if(maxDistance==Integer.MAX_VALUE)
        maxDistance=d;

    int i = Math.max(d-maxDistance, 1);
    BKTree temp=null;

    for(;i<=maxDistance+d;i++)
    {
        temp=children.get(i);
        if(temp!=null)
        {
            tempDistance=temp.get(query, maxDistance);
        }
        if(maxDistance<tempDistance)
            maxDistance=tempDistance;
    }

    return maxDistance;
}

I know that I am running the loop an unnecessarily large number of times and that we can trim the search space to make the lookup faster. I'm just not sure how to best do that.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
efficiencyIsBliss
  • 3,043
  • 7
  • 38
  • 44
  • 2
    @Mitch - That may be true...but people answering only on the pretense of being accepted is starting to get a little old. Shouldn't people be answering to be helpful? – Justin Niessner Oct 05 '10 at 16:38
  • @efficiencyIsBliss - I answer questions because I need my answers accepted. Good luck with this one. – IVlad Oct 05 '10 at 16:51
  • 4
    @Justin, I understand where you are coming from. But I think a healthy argument can be made that it is good, from the perspective of the communal knowlege pool that is SO, to encourage citizens to engage in best practices. A question with a checked-answer is more useful for the random googler who happens upon SO than one without such an answer. – Kirk Woll Oct 05 '10 at 16:53
  • 3
    Do you see that box way up at the top there? the one that says "unanswered". That is why people need to accept answers. They are polluting the list and wasting the time of people trying to help with questions that are actually unanswered. – Andrew Oct 05 '10 at 16:55
  • Looks like this question isn't getting answered. – efficiencyIsBliss Oct 05 '10 at 17:12
  • This doesn't look like a good question, since performance will be dominated by code you don't include, and the search space is also affected by that code. – David Thornley Oct 05 '10 at 18:08
  • 1
    Out of curiosity, what language spelling requires *millions of words* ? – Déjà vu Oct 06 '10 at 13:01
  • @ring0 All the words I've seen so far are English, so I don't really know why it's so large. Maybe it has a lot of words that aren't really words. I guess it's just to make the problem harder. – efficiencyIsBliss Oct 07 '10 at 01:38
  • @efficiencyisbliss *make problem harder* ? Homework? :-) – Déjà vu Oct 07 '10 at 05:21
  • @ring0 Sorry to disappoint you, but no. – efficiencyIsBliss Oct 07 '10 at 20:09

1 Answers1

1

Your loop looks generally correct, if a little byzantine. Your attempt to refine the stopping condition (with tempdistance/maxdistance) is incorrect, however: the structure of the BK-tree requires that you explore all nodes within levenshtein distance d-k to d+k of the current node if you want to find all the results, so you can't prune it like that.

What makes you think you're exploring too much of the tree?

You may find my followup post on Levenshtein Automata instructive, as they're more efficient than BK-trees. If you're building a spelling checker, though, I'd recommend following Favonius' suggestion and checking out this article on how to write one. It's much better suited to spelling correction than a naive string-distance check.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • I was aware of the d-k to d+k part and I implemented it, but it gave me incorrect results, which was why I got rid of it completely. That's why I was so sure that I wasn't trimming the search space efficiently. Could you explain that part a bit more here? Do the d and k remain constant or do they change with every iteration down the tree? – efficiencyIsBliss Oct 07 '10 at 01:46
  • 'k' is the threshold, and remains constant. 'd' is the distance between the search term and the current node, and depends on the node you're evaluating. – Nick Johnson Oct 07 '10 at 11:16
  • To reduce the search space, can we change k to mirror the minimum distance found so far? If we know that the first word we looked at was at a distance of 5 from our word, then there is no point in looking at words that may be at a distance of 6 or higher, right? – efficiencyIsBliss Oct 09 '10 at 21:48