1

We are trying to use Tesseract with Tess4j for OCR text extraction.

On continuous use of tesseract over a period, we notice the RAM used by the application getting increased gradually, During this time, The heap memory is still free. We monitored the off-heap memory using the jconsole. Off-heap memory also seems normal. But the RAM RSS memory is keeps increasing for the application

The problem I'm guessing is memory leak by tesseract during memory allocation of OCR, I'm not sure. Any ideas to approach further, please share

enter image description here

enter image description here

enter image description here

aravinth
  • 416
  • 1
  • 5
  • 20
  • In python we saw similar effects but essentially decided that this is not a leak although tesseract appears to consume more and more. When python (or in your case the JVM) decides to free memory is up to the specific implementation and not "the task is done - free memory now". Does you application crash due to memory limits? – TeddybearCrisis Jun 26 '20 at 11:39
  • Have a same issue on ubuntu sever, with python, eventually disk space ran out in case of mine,please help – sumesh shetty Jan 15 '21 at 07:17
  • Hey aravinth. Were you able to fix this issue? – Iana Mykhailenko Mar 09 '21 at 10:05
  • @IanaMykhailenko sorry we couldn't but the issue stopped when we moved to physical machines instead of VM's – aravinth Mar 12 '21 at 10:40

2 Answers2

1

I had same issue since last few days. I resolved by removing tess4j and using Tika 1.27 + tesseract. I used Executor service to run 3 threads at a time this kept memory within limits.

    byte fileBytes[] ; // image bytes
    Future<String> future = executorService.submit(() -> {
    TesseractOCRConfig config = new TesseractOCRConfig();
    config.setLanguage("kor+eng");
    config.setEnableImageProcessing(1);
    config.setPreserveInterwordSpacing(true);
    ParseContext context = new ParseContext();
    context.set(TesseractOCRConfig.class, config);

    Parser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    parser.parse(new ByteArrayInputStream(fileBytes), handler, metadata, context);
    return handler.toString();
});

fileBody = future.get(120, TimeUnit.SECONDS);

While the code given above works, later i made it simpler by just spawning a process to call tesseract directly.

protected String doOcr(byte[] fileBytes, int timeout, String language) {
        String text = null;
        File inputFile = null;
        File outputFile = null;
        try {
            inputFile = File.createTempFile("tesseract-input", ".png");
            String outputPath = inputFile.getAbsolutePath() + "-output";
            outputFile = new File(outputPath + ".txt");
            try (FileOutputStream fos = new FileOutputStream(inputFile)) {
                fos.write(fileBytes);
            }

            String commandCreate[] = { "tesseract", inputFile.getAbsolutePath(), outputPath, "-l", language, "--psm", "1" ,"-c", "preserve_interword_spaces=1"};

            runCommand(commandCreate, timeout);
            if (outputFile.exists()) {
                try (FileInputStream fis = new FileInputStream(outputFile)) {
                    text = IOUtils.toString(fis, Constants.UTF_8);
                }
            }
        } catch (InterruptedException e) {
            logger.warn("timeout trying to read image file body");          
        } catch (Exception e) {
            logger.error(String.format("Cannot read image file body, error : %s", e.getMessage()), e);          
        } finally {
            if (null != inputFile && inputFile.exists()) {
                inputFile.delete();
            }
            if (null != outputFile && outputFile.exists()) {
                outputFile.delete();
            }
        }       
        return text;
    }

protected void runCommand(String command[], int timeout) throws IOException, InterruptedException {
        logger.info("command : " + StringUtils.join(command, " "));
        ProcessBuilder builder = new ProcessBuilder(command);
        builder.inheritIO();
        builder.environment().put("OMP_THREAD_LIMIT", "1"); /* default tesseract uses 4 threads per image. set to 1 */
        Process p = builder.start();
        boolean finished = p.waitFor(timeout, TimeUnit.SECONDS);
        if (!finished) {
            logger.warn("task not finished");
        }
        p.destroyForcibly();
    }
jkb016
  • 439
  • 1
  • 7
  • 17
  • Thanks for this info. Can this handle text style properties like font properties? – ken4ward Dec 29 '21 at 10:57
  • 1
    Tesseract just converts to text. You can test your font images with commandline. If it works on commandline it will work in Java also. If you have to pass any extra commandline parameters, see if you can find them in TesseractOCRConfig. – jkb016 Dec 29 '21 at 18:25
0

For those who are stuck and don't want to change their code, or maven library, i've solved setting my tesseract reader class to null after reading and forcing Garbage Collector, with System.gc(); Example:

TessReader reader = new TessReader(); //Custom Class executing doOCR()
String content = reader.getContent();
reader = null;
System.gc();