0

Have words from OCR and need a list of close matches. Can live without the maxFrom. The sample code is brute force but hopefully it defines the requirement. Against of list of 600,000 this takes 2 seconds. FTSword.Word is a string.

Ideally "findd" would only give additional credit to a second d. And once it finds an i then f gets no credit. Brute force I can do that. I am looking to take that 2 seconds down. Will test and report any solution proposed.

The question?? is. How to make it faster? (and smarter)

Thanks

            char[] find = new char[] { 'f', 'i', 'n', 'd' };
            char[] word;
            int maxFrom = 10;
            int minMatch = 3;
            int count;
            List<FTSword> matchWords = new List<FTSword>();
            foreach (FTSword ftsw in fTSwords)
            {
                if (ftsw.Word.Length < maxFrom)
                {
                    word = ftsw.Word.ToCharArray();
                    count = 0;
                    foreach (char fc in find)
                    {
                        foreach (char wc in word)
                        {
                            if (char.ToLower(wc) == char.ToLower(fc))
                            {
                                count++;
                                break;
                            }
                        }
                    }
                    if (count >= minMatch)
                    {
                        // Debug.WriteLine(count.ToString() + ftsw.Word);
                        matchWords.Add(ftsw);
                    }
                }
            }
            Debug.WriteLine(matchWords.Count.ToString());
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • 1
    You may find approaches here for precalculating data helpful in generating faster search results: http://stackoverflow.com/questions/10096744/puzzle-solving-finding-all-words-within-a-larger-word-in-php/10096985#10096985. Ideally you either reduce the number of operations per word as in this example, or you reduce the search space by indexing or partitioning out unnecessary words. – mellamokb Apr 18 '12 at 01:19
  • @mellamokb that link deals with internal matches but does not score partial. – paparazzo Apr 18 '12 at 02:08
  • @DBM Taking the 2 seconds down hopefully defines the question. Thanks. – paparazzo Apr 18 '12 at 02:24
  • since `string` implements `IEnumerable`, your innermost loop could be `foreach(char wc in ftsw.Word)`, eliminating your need for the `char[] word` use altogether. Also, note that `char.ToLower` goes through current-culture-based conversions. – devgeezer Apr 18 '12 at 07:19

2 Answers2

1

Your core algorithm currently is O(n^2) since you have two nested loops looking for matching characters. You can easily make that part O(n) by using a Dictionary that contains the character counts for each character in the find string:

string find = "find";
var findMap = new Dictionary<char, int>();
foreach (char c in find)
{
    if (findMap.ContainsKey(c))
    {
        findMap[c] = findMap[c] + 1;
    }
    else
        findMap.Add(c, 1);
}
//findMap is pre-generated once

string word = "pint";
int count = 0;

//runs for each word in list, now in O(n)
foreach(char c in word)
{
    int charCount;
    if(findMap.TryGetValue(c, out charCount))
    {
        if(charCount > 0)
        {
            charCount--;
            findMap[c] = charCount;
            count++;
        }
    }
}
BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
  • Comparing O( ) on such a small 'n' is misleading / useless. Short of profiling there is no way to prove this is any faster than the approach in the question. It might be, or it might not be. – Ian Mercer Apr 18 '12 at 04:04
  • This cut the time in 1/2 and it accounts for multiple matches of the same character. What is misses case insensitive. Thanks! I was hoping for some LINQ magic in milliseconds but this will do. – paparazzo Apr 18 '12 at 14:15
1

You can remove the char.ToLower() on fc if you ensure it's lower-cased before you start.

You could also try using IndexOf() to find the first (and then subsequent occurrences of the character) as the BCL implementation may internally be faster than you can manage with your own loop.

You could also try running your loops in reverse which can provide a speedup:

 for (int i = arr.Length - 1; i >= 0; i--)

But really, for OCR why would you sum up matching characters from arbitrary positions in the string instead of doing a true edit distance like Damerau-Levenshtein?

Ian Mercer
  • 38,490
  • 8
  • 97
  • 133
  • The edit distance link if very interesting. In OCR we also get two words concatenated and it does not deal with that but I might be smart enough to modify it. – paparazzo Apr 18 '12 at 13:03
  • I do need to use something more advanced as a simple hit count is not selective enough. example matches homeplate to 6 characters – paparazzo Apr 18 '12 at 14:26
  • That alogorith ran in 6 seconds against the 600,000. Slower but it yields a meaningfull answer. Like I said before it does not deal with OCR concatenations but it was not designed for that. Thanks – paparazzo Apr 18 '12 at 14:59