The TESSDATA_PREFIX is set to the parent folder of the tessdata folder of the commandline tesseract 4.0.0 (C:\Program Files (x86)\Tesseract-OCR). The commandline tesseract produces reasonable output in all four OCR Engine Modes.
Here is my code:
package tessTest;
import java.util.ArrayList;
import java.util.List;
import net.sourceforge.tess4j.*;
import net.sourceforge.tess4j.ITesseract.RenderedFormat;
public class MainClass {
public static void main(String[] argv) {
ITesseract instance = new Tesseract1();
List<RenderedFormat> formats = new ArrayList<RenderedFormat>();
formats.add(RenderedFormat.PDF);
try {
instance.setPageSegMode(1);
instance.setOcrEngineMode(2);
instance.setTessVariable("textonly_pdf", "1");
instance.createDocuments("D:\\Documents\\Malverne.jpeg",
"D:\\Documents\\testOCR", formats);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
With setOcrEngineMode(1) or setOcrEngineMode(0) it produces the pdf as expected. With setOcrEngineMode(2) and setOcrEngineMode(3) it results in the following error:
Exception in thread "main" java.lang.Error: Invalid memory access
at net.sourceforge.tess4j.TessAPI1.TessBaseAPIProcessPages(Native Method)
at net.sourceforge.tess4j.Tesseract1.createDocuments(Tesseract1.java:542)
at net.sourceforge.tess4j.Tesseract1.createDocuments(Tesseract1.java:517)
at net.sourceforge.tess4j.Tesseract1.createDocuments(Tesseract1.java:484)
at tessTest.MainClass.main(MainClass.java:21)
Detected 358 diacritics
contains_unichar_id(unichar_id):Error:Assert failed:in file
c:\projects\github\tesseract-ocr\ccutil\unicharset.h, line 513
It seems to be an issue with this particular image since on other images OEM 2 works fine from tess4j 4.0.0. I am aware that preprocessing the image will probably help, but I am working on a project where regularily many thousands of pictures, only some of which are similar to this one, will have to be OCRed by users so tailored preprocessing on a case-by-case-basis is infeasible.
The image in question is this one: http://malvernetheatre.org/wp-content/uploads/2012/07/Malverne-Community-Theatre-Newspaper-Reviews-14-page-0.jpg
Any help would be greatly appreciated. Many thanks in advance.