2

I am using Tess4j for using Tesseract-OCR technology and I have been using the following code:

Code sample

During testing I wanted to test the catch close so I was feeding wrong information to Tesseract, which should result in TesseractException. I managed to induce a TesseractException from the createDocuments() method. Here is the stack trace: Console Output

Note that in the exception we can find doOcr()'s line 125, which is within the try-catch clause, but even though console shows a TesseractException being thrown, the code moves onto line 126 returning true.

I use net.sourceforge.tess4j.Tesseract to initiate the OCR proccess, but I tried net.sourceforge.tess4j.Tesseract1 too, which resulted the same red console output that is done by Tess4j, but no TesseractException.

My question is what am I doing wrong? I am just assuming there is an issue with my code, because TesseractExceptionis being thrown, but my code is not catching it.

  • If line 125 `restInstance.createDocuments(...)` throws an exception, it is not possible for line 126 to be executed. You see the log with the stack trace on the console - where is it coming from? Use debugger and check what lines are being executed. Also, can you please show the import of `TesseractException`? Maybe there is more than one and you imported wrong one? – Jaroslaw Pawlak Jul 24 '19 at 08:07
  • @JaroslawPawlak During debug console outputs the TesseractException when I am at line 125. Also i double checked my imports again and this is the only tesseractexception. net.sourceforge.tess4j.TesseractException; – Kristóf Horváth Jul 24 '19 at 08:09

2 Answers2

1

Look at the source code of Tesseract.java:

@Override
public void createDocuments(String[] filenames, String[] outputbases, List<RenderedFormat> formats) throws TesseractException {
    if (filenames.length != outputbases.length) {
        throw new RuntimeException("The two arrays must match in length.");
    }

    init();
    setTessVariables();

    try {
        for (int i = 0; i < filenames.length; i++) {
            File workingTiffFile = null;
            try {
                String filename = filenames[i];

                // if PDF, convert to multi-page TIFF
                if (filename.toLowerCase().endsWith(".pdf")) {
                    workingTiffFile = PdfUtilities.convertPdf2Tiff(new File(filename));
                    filename = workingTiffFile.getPath();
                }

                TessResultRenderer renderer = createRenderers(outputbases[i], formats);
                createDocuments(filename, renderer);
                api.TessDeleteResultRenderer(renderer);
            } catch (Exception e) {
                // skip the problematic image file
                logger.error(e.getMessage(), e);
            } finally {
                if (workingTiffFile != null && workingTiffFile.exists()) {
                    workingTiffFile.delete();
                }
            }
        }
    } finally {
        dispose();
    }
}

/**
 * Creates documents.
 *
 * @param filename input file
 * @param renderer renderer
 * @throws TesseractException
 */
private void createDocuments(String filename, TessResultRenderer renderer) throws TesseractException {
    api.TessBaseAPISetInputName(handle, filename); //for reading a UNLV zone file
    int result = api.TessBaseAPIProcessPages(handle, filename, null, 0, renderer);

    if (result == ITessAPI.FALSE) {
        throw new TesseractException("Error during processing page.");
    }
}

Exception is thrown at line 579. This method is called by a public method above - at line 551. This is inside the try-catch block with logger.error(e.getMessage(), e); in the catch body (line 555).

Now the question is what you really want to achieve?

If you don't want to see this log, you can configure slf4j to not print the log from this library.

If you want to get the actual exception, it is not possible as the library swallows it. I am not familiar with the library, but looking at the code it doesn't seem like there is any nice option - the method that throws the exception is private and is used only in this one place - under the try-catch block. However, the exception is thrown when api.TessBaseAPIProcessPages(...) returns ITessAPI.FALSE and api has a getter. So you could get it, call TessBaseAPIProcessPages(...) method and check for the result. This might be not ideal as you will probably be processing every image twice. Another solution is to fork the source code and modify it yourself. You might also want to contact the author and ask for advice - you could take it further and submit a pull request for them to approve and release.

Jaroslaw Pawlak
  • 5,538
  • 7
  • 30
  • 57
  • Okey, It catches the exception, then why throw it? Also how can I work around this or is there fault proof way of inducing TesseractException, which I can catch? – Kristóf Horváth Jul 24 '19 at 08:13
  • Thank you for the swift answer. I was afraid of this, but I guess that is life. I will try to contact author, also It is likely he will find this post. I simply wanted to determine the success of the Tesseract-Ocr process by catching TesseractException, but the swallowed exception really screws with that. – Kristóf Horváth Jul 24 '19 at 08:22
0

Add pdf.ttf file to tessdata path (tessdata/pdf.ttf) pdf.ttf