0

The TESSDATA_PREFIX is set to the parent folder of the tessdata folder of the commandline tesseract 4.0.0 (C:\Program Files (x86)\Tesseract-OCR). The commandline tesseract produces reasonable output in all four OCR Engine Modes.

Here is my code:

package tessTest;

import java.util.ArrayList;
import java.util.List;

import net.sourceforge.tess4j.*;
import net.sourceforge.tess4j.ITesseract.RenderedFormat;

public class MainClass {
    public static void main(String[] argv) {


        ITesseract instance = new Tesseract1();  
         List<RenderedFormat> formats = new ArrayList<RenderedFormat>();
         formats.add(RenderedFormat.PDF);

        try {
            instance.setPageSegMode(1);
            instance.setOcrEngineMode(2);
            instance.setTessVariable("textonly_pdf", "1");
            instance.createDocuments("D:\\Documents\\Malverne.jpeg",
                                     "D:\\Documents\\testOCR", formats);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }   
    }

}

With setOcrEngineMode(1) or setOcrEngineMode(0) it produces the pdf as expected. With setOcrEngineMode(2) and setOcrEngineMode(3) it results in the following error:

Exception in thread "main" java.lang.Error: Invalid memory access
    at net.sourceforge.tess4j.TessAPI1.TessBaseAPIProcessPages(Native Method)
    at net.sourceforge.tess4j.Tesseract1.createDocuments(Tesseract1.java:542)
    at net.sourceforge.tess4j.Tesseract1.createDocuments(Tesseract1.java:517)
    at net.sourceforge.tess4j.Tesseract1.createDocuments(Tesseract1.java:484)
    at tessTest.MainClass.main(MainClass.java:21)
Detected 358 diacritics
contains_unichar_id(unichar_id):Error:Assert failed:in file 
c:\projects\github\tesseract-ocr\ccutil\unicharset.h, line 513

It seems to be an issue with this particular image since on other images OEM 2 works fine from tess4j 4.0.0. I am aware that preprocessing the image will probably help, but I am working on a project where regularily many thousands of pictures, only some of which are similar to this one, will have to be OCRed by users so tailored preprocessing on a case-by-case-basis is infeasible.

The image in question is this one: http://malvernetheatre.org/wp-content/uploads/2012/07/Malverne-Community-Theatre-Newspaper-Reviews-14-page-0.jpg

Any help would be greatly appreciated. Many thanks in advance.

Grada Gukovic
  • 1,228
  • 7
  • 13
  • have you tried setting the data path example instance.setDatapath("C://t"); where the tessdata is in the t folder like C:\t\tessdata. This worked for me – Tinus Jackson Jan 24 '18 at 12:32
  • Ref to above i found this https://stackoverflow.com/a/47076001/4712391 – Tinus Jackson Jan 24 '18 at 12:40
  • It produces output in two of the four OEMs (and on many other pictures) so it definately finds the tessdata. Two days after Nguyen answered to the question you linked, he actually updated the Tesseract1 class to use the System.getenv("TESSDATA_PREFIX") as default datapath. – Grada Gukovic Feb 18 '18 at 09:14

0 Answers0