1

Below is a function that uses TessNet2 (OCR framework) to scan through a list of words captured by the OCR function built into TessNet2. Since the pages that I'm scanning in our less than perfect quality the detection of the words are not 100% accurate.

So sometimes it will confuse an 'S' with a '5' or an 'l' with a '1'. Also, it doesn't take into account capitalization. So I have to search for both cases.

The way it works is that I am searching for certain words that are close to each other on the paper. So the first set of words [I] is " Abstracting Service Ordered". If the page contains those words next to each other then it moves to the next set of words [j], and then the next [h]. If the page contains all 3 sets of words then it returns true.

This is the best method I've thought of but I'm hoping someone here can give me another way to try.

public Boolean isPageABSTRACTING(List<tessnet2.Word> wordList)
    {

        for (int i = 0; i < wordList.Count; i++) //scan through words
        {
            if ((wordList[i].Text == "Abstracting" || wordList[i].Text == "abstracting" || wordList[i].Text == "abstractmg" || wordList[i].Text == "Abstractmg" && wordList[i].Confidence >= 50) && (wordList[i + 1].Text == "Service" || wordList[i + 1].Text == "service" || wordList[i + 1].Text == "5ervice" && wordList[i + 1].Confidence >= 50) && (wordList[i + 2].Text == "Ordered" || wordList[i + 2].Text == "ordered" && wordList[i + 2].Confidence >= 50)) //find 1st tier check
            {
                for (int j = 0; j < wordList.Count; j++) //scan through words again
                {
                    if ((wordList[j].Text == "Due" || wordList[j].Text == "Oue" && wordList[j].Confidence >= 50) && (wordList[j + 1].Text == "Date" || wordList[j + 1].Text == "Oate" && wordList[j + 1].Confidence >= 50) && (wordList[j + 2].Text == "&" && wordList[j + 2].Confidence >= 50)) //find 2nd tier check
                    {
                        for (int h = 0; h < wordList.Count; h++) //scan through words again
                        {
                            if ((wordList[h].Text == "Additional" || wordList[h].Text == "additional" && wordList[h].Confidence >= 50) && (wordList[h + 1].Text == "comments" || wordList[h + 1].Text == "Comments" && wordList[h + 1].Confidence >= 50) && (wordList[h + 2].Text == "about" || wordList[h + 2].Text == "About" && wordList[h + 2].Confidence >= 50) && (wordList[h + 3].Text == "this" || wordList[h + 3].Text == "This" && wordList[h + 3].Confidence >= 50)) //find 3rd tier check
                            {
                                return true;
                            }
                        }
                    }
                }
            }
        }

        return false;
    }
MaylorTaylor
  • 4,671
  • 16
  • 47
  • 76

2 Answers2

2

Firstly there's no need for the redundant nesting loops, each inner loop doesn't depend on anything from the outer loop, so there's no need for the huge performance penalty from looping over the words N^3 times (as opposed to 3N).

Secondly, I think there are definitely more elegant approaches (like using a dictionary of words and calculating a best match for words that aren't in the dictionary, or other more dynamic approaches), but they would involve more complicated algorithms. A simple approach to the equivalent can be done using regular expressions:

// combine all the words into 1 string separated by a space
// where the confidence is high enough
// use a word that the regex's won't match for words where the confidence
// isn't high enough
var text = wordList.Select(w => w.Confidence >= 50 ? w.Text : "DONTMATCH")
           .Aggregate((x,y) => x + " " + y);

// now run the text through regular expressions 
// to match each criteria allowing for case insensitivity
// and known misidentifications
if (!Regex.IsMatch(text, @"abstract(in|m)g\s+(s|5)ervice\s+ordered", RegexOptions.IgnoreCase))
    return false;

if (!Regex.IsMatch(text, @"(d|o)ue\s+(d|o)ate\s+&", RegexOptions.IgnoreCase))
    return false;

if (!Regex.IsMatch(text, @"additional\s+comments\s+about\s+this", RegexOptions.IgnoreCase))
    return false;
return true;

Since your algorithm is only interested in a few specific phrases, and you don't want it to match when the confidence for a word is too low, we can easily combine all the words into one long string separated by spaces (for convenience). Then we construct regular expressions to cater for the 3 phrases of interest with the known alternatives, and just test the concatenated string against the regular expressions.

It obviously is only going to cater for this very specific case tho...

Martin Ernst
  • 5,629
  • 2
  • 17
  • 14
  • Do you mind explaining a bit of the code. When I run this, 'text' becomes a string of all the text in the OCR results but a few words have "DONTMATCH". I'm assuming this is because of the confidence not being greater than 50. – MaylorTaylor Jul 11 '13 at 16:39
1

You can try to use some vocabulary, and find a word closest to recognized by Levenstein Distance.

SkyterX
  • 123
  • 1
  • 8