3

I've build an application that uses Tesseract (V3.03 rc1) to identify some specific text strings. These are, unfortunately, printed on a custom font that requires that I build my own traineddata file. I've built the application on both iOS (using https://github.com/gali8/Tesseract-OCR-iOS for inspiration) and Android (using https://github.com/rmtheis/tess-two/ for inspiration as well).

The workflow for both platforms is as follows:

  • I select a bounding box on the preview screen for where I can crop out the relevant text, and crop the image accordingly.

  • I use OpenCV to get a binary image (using OpenCV's adaptive threshold function with the same parameters for both platforms)

  • I pass this binary image to Tesseract. Both platforms (Android and iOS) use the same traineddata file.

And yet, iOS recognizes the text strings perfectly, while Android keeps misidentifying certain characters (6s for Ss, As for Hs).

On both platforms, I use the same white list string, I disable load_type_dawg and load_system_dawg, and also choose to save the blob choices.

Has anyone encountered this kind of situation before? Am I missing a setting on Android that's automatically handled in iOS? Is there something particular about Android that hasn't crossed my mind?

Any thoughts or advice would be greatly appreciated!

dedual
  • 153
  • 2
  • 10
  • 1
    Have you tried removing uncertainty coming from the device cameras? If you take a still photo of what you are trying to OCR, and give that still photo to step 2 (openCV) on both platforms, do you still get a discrepancy? I'm trying to understand if the problem is coming from Tesseract or from higher levels in your stack. – Robin Eisenberg May 29 '15 at 15:19
  • Hi Robin! Just tested this theory out. Took the binarized image from iOS and used it as input on Android (bypassing OpenCV and just going straight to Tesseract) and, funny enough, it's not recognizing the image. Yet, on iOS, it does. – dedual May 29 '15 at 15:54
  • Is it not recognising the format, or is it having trouble with the actual OCR? Also, are you using wrappers around tesseract for iOS and Android (you mentionned tess-two for 'inspiration', but are you actually doing the wrapping around tesseract yourself? – Robin Eisenberg May 29 '15 at 15:56
  • It's having trouble with the actual OCR on Android. Yes, I'm using wrappers both on iOS and Android. Specifically on Android I'm using Tess-Two's as a starting point and deviating from it so as to fit some of our needs. – dedual May 29 '15 at 16:08
  • The underlying C++ code from tesseract should be the same, so I'm guessing one of your wrappers forces and OCR setting that changes the recognition algorithm. You need to look deep in both wrappers, and make sure the Android one is passing the same recognition parameters to tesseract as your iOS one. – Robin Eisenberg May 29 '15 at 16:15
  • I suspected as much. I'll start first by looking into the Bitmap to Pixa wrapper, see how they deviate. Thanks for your help, Robin! Will report my findings! – dedual May 29 '15 at 16:18
  • 1
    What might also point you in the right direction is using ndk_gdb to set a breakpoint on the C++ call to Tesseract. You can check that the raw data being passed is the same between the two platforms. If you use the same image on both platforms, diff-ing this function call could point you to the answer. Sorry I couldn't be of more help, and good luck. If you find the answer don't forget to post it, I'm really curious to know what happened here :) – Robin Eisenberg May 29 '15 at 16:35

1 Answers1

1

So, after a lot of work, I found out what was wrong with my Android application (thankfully, it wasn't an issue with Tesseract at all). As I'm more familiar with iOS apps than Android, I wasn't sure how I could load the traineddata file onto the application without requiring the user to have the file loaded on their external storage device. I found inspiration in this project (http://www.codeproject.com/Tips/840623/Android-Character-Recognition), as they autoload the trained data file.

However, I misunderstood how it worked. I originally thought that the TessDataManager did a file lookup on the project's local tesseract/tessdata folder in order to get the trained data file (as I do this also on iOS). However, that's not what it does. It, rather, checks the internal file structure (data/data/projectname/files/tesseract/tessdata/traineddatafilegoeshere) to see if the file exists and if it doesn't, it copies over the trained data file it keeps in the Resources/Raw directory. In my case, it defaulted to the eng file, so it never read my custom font file.

Hopefully this helps someone else having similar issues. Thanks to Robin and RmTheis for all of your help!

dedual
  • 153
  • 2
  • 10