2

I am working on a program to extract Chinese hard-coded subtitles from videos. The final step of my program uses the JavaCPP Presets for Tesseract library for OCR.

TessBaseAPI api = new TessBaseAPI();

// Initialize tesseract-ocr with Simplified Chinese
if (api.Init(pathToTessdata, "chi_sim") != 0) {
    System.err.println("Could not initialize tesseract.");
    System.exit(1);
}

// Open input image with leptonica library
PIX image = pixRead(pathToInputImage);
api.SetImage(image);

// Get OCR result
BytePointer outText = api.GetUTF8Text();
String result = outText.getString();

The recognition of the character works pretty well, but Tesseract occasionally adds spaces where no space is in the image. Take the following picture into consideration:

The correct output would be 这么快就到了, but Tesseract renders this as 这么快就到 了. I believe this is because the character 了 - apart from the top horizontal line - consists of just one vertical line with lots of empty space to the left.

I found two people who already faced the same problem using Tesseract with C++: Tesseract False Space Recognition and How to keep Tesseract from inserting extra whitespace in words?. Solutions suggested changing the setting for tosp_min_sane_kn_sp. However, I was not able to get that running with Java.

Google doesn't find anything for javacpp tosp_min_sane_kn_sp, either. I tried api.SetVariable("tosp_min_sane_kn_sp", "2.8");, but without success.

I also tried to declare that my font is monospace - as Chinese characters are per definition monospace - but couldn't figure how to do that. Also, if some numbers appeared in the subtitle, they may not be monospace.

So my question: How do I change how sensitive Tesseract is to spaces with Java?

By the way, simply removing all spaces is not an option, as some subtitles do contain spaces, such as between the first and second character in this text:

If it helps, spaces are generally that wide as in this picture. Any help is greatly appreciated.

Alexander Jank
  • 2,440
  • 2
  • 18
  • 19

0 Answers0