0

My application involves scanning through the phone camera and detecting text. The only words that my application is concerned with is valid english words. I have a list of ~354,000 valid english words that i can compare my scanned word with.

Since my application continuously detects text, i need this functionality to be very very fast. I have applied Levenshtein Distance technique. For each word, I:

  1. Store the contents of the text file into an Arraylist<String> using Scanner
  2. Calculate Levenshtein Distance of the word with each of the 354k words
  3. Return the word corresponding to the minimum distance value

The problem is that it is very very slow. Without applying this, my app manages to ocr more than 20 words in around 70 to 100 millisecond. When i include this fixing routine, my app takes more that 1 full minute (60000ms) for a single word.

I was wondering if this technique is even suitable, given my case. If not, what other tested way should i go with? Any help would be greatly appreciated. I know this is possible, looking at how android keyboards are able to instantly correct our incorrectly typed words.

Other Failed endeavors:

  • Jaro distance. (similar)
  • Android internal SpellCheckerSession service. (doesn't fit my case. Result receipt via a callback is the issue)
Abdul Wasae
  • 3,614
  • 4
  • 34
  • 56
  • @Andy cutting down the list is not an option unfortunately. May there be any hashing or mapping technique that i could incorporate here? – Abdul Wasae Jul 15 '16 at 15:07
  • 1
    I'm not necessarily meaning "cutting down the list" as in completely discarding words; I mean that you need a way to partition the list such that you don't search words which it is impossible to match. For instance, I presume that you have some idea about how long the word will be - can you check only words of that length +/- 1, say? – Andy Turner Jul 15 '16 at 15:24
  • @Andy Say even if i somehow narrow down the words list to one tenth, that still means 6 seconds per word. Whereas, smartphone keyboards are able to do it instantly. I would really like to know about that technique – Abdul Wasae Jul 15 '16 at 15:26
  • It's a step in the right direction, isn't it? – Andy Turner Jul 15 '16 at 15:26
  • @AndyTurner, unfortunately in the context of OCR, it is a no go. OCR can get a character wrong. Additionally, they can totally miss a character in the sense that if 'I' is recognized as '1', i will have already removed numbers from the recognizeds words. Hence M1ster -> Mster -> will not consider Mister ever – Abdul Wasae Jul 15 '16 at 15:29
  • You can't make a computer work faster; you can only have it do less stuff it doesn't need to. Unless you can make your distance calculation orders of magnitude faster, you've simply got to make it check fewer strings, by some means. – Andy Turner Jul 15 '16 at 15:32
  • @AndyTurner I totally understand. I do however strongly suspect there there is a whole other technique out there that would fit here better than the 'minimum edit distance' method. Would you by any chance know of some alternatives? – Abdul Wasae Jul 15 '16 at 15:34

1 Answers1

0

My Solution that works: I created a MYSQL table and uploaded the list of valid english words in it. It solves all the problems addressed in the question.

Here is my Android Application for reference: Optical Dictionary & Vocabulary Teacher

Abdul Wasae
  • 3,614
  • 4
  • 34
  • 56