3

I am working on a Spring-MVC application in which I am using Tesseract for OCR. I am getting an Index out of bounds exception for the file I am passing. Any ideas?

Error log :

et.sourceforge.tess4j.TesseractException: java.lang.IndexOutOfBoundsException
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:215)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:196)
    at com.tooltank.spring.service.GroupAttachmentsServiceImpl.testOcr(GroupAttachmentsServiceImpl.java:839)
    at com.tooltank.spring.service.GroupAttachmentsServiceImpl.lambda$addAttachment$0(GroupAttachmentsServiceImpl.java:447)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException
    at javax.imageio.stream.FileCacheImageOutputStream.seek(FileCacheImageOutputStream.java:170)
    at net.sourceforge.tess4j.util.ImageIOHelper.getImageByteBuffer(ImageIOHelper.java:297)
    at net.sourceforge.tess4j.Tesseract.setImage(Tesseract.java:397)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:290)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:212)
    ... 4 more

Code :

 private String testOcr(String fileLocation, int attachId) {
        try {
            File imageFile = new File(fileLocation);
            BufferedImage img = ImageIO.read(imageFile);
            BufferedImage blackNWhite = new BufferedImage(img.getWidth(), img.getHeight(), BufferedImage.TYPE_BYTE_BINARY);
            Graphics2D graphics = blackNWhite.createGraphics();
            graphics.drawImage(img, 0, 0, null);
            String identifier = String.valueOf(new BigInteger(130, random).toString(32));
            String blackAndWhiteImage = previewPath + identifier + ".png";
            File outputfile = new File(blackAndWhiteImage);
            ImageIO.write(blackNWhite, "png", outputfile);

            ITesseract instance = new Tesseract();
            // Point to one folder above tessdata directory, must contain training data
            instance.setDatapath("/usr/share/tesseract-ocr/");
            // ISO 693-3 standard
            instance.setLanguage("deu");
            String result = instance.doOCR(outputfile);
            result = result.replaceAll("[^a-zA-Z0-9öÖäÄüÜß@\\s]", "");
            Files.delete(new File(blackAndWhiteImage).toPath());
            GroupAttachments groupAttachments = this.groupAttachmentsDAO.getAttachmenById(attachId);
            System.out.println("OCR Result is "+result);
            if (groupAttachments != null) {
                saveIndexes(result, groupAttachments.getFileName(), null, groupAttachments.getGroupId(), false, attachId);
            }
            return result;
        } catch (Exception e) {
            e.printStackTrace();

        }
        return null;
    }

Thank you.

We are Borg
  • 5,117
  • 17
  • 102
  • 225

2 Answers2

4

Due to a bug in Java Image IO (which was fixed with Java 9), the current version of Java Tesseract Wrapper (3.4.0 as this answer was written) does not work with < Java 9. To work with lower Java versions, you can try the following fix to Tesseract ImageIOHelper class. Simply make a copy of the class in your project and apply the necessary changes and it will work with both files and BufferedImages smoothly.

Note: This version does not use the Tiff optimization used in the original class, you can add it if it is necessary for your project.

public static ByteBuffer getImageByteBuffer(RenderedImage image) throws IOException {
    //Set up the writeParam
    if (image instanceof BufferedImage) {
        return convertImageData((BufferedImage) image);
    }
    ColorModel cm = image.getColorModel();
    int width = image.getWidth();
    int height = image.getHeight();
    WritableRaster raster = cm
            .createCompatibleWritableRaster(width, height);
    boolean isAlphaPremultiplied = cm.isAlphaPremultiplied();
    Hashtable properties = new Hashtable();
    String[] keys = image.getPropertyNames();
    if (keys != null) {
        for (int i = 0; i < keys.length; i++) {
            properties.put(keys[i], image.getProperty(keys[i]));
        }
    }
    BufferedImage result = new BufferedImage(cm, raster,
            isAlphaPremultiplied, properties);
    image.copyData(raster);
    return convertImageData(result);
}
ruhsuzbaykus
  • 13,240
  • 2
  • 20
  • 21
  • So I should replace the getImageBytBuffer method in ImageIOHelper with the code you provided. How do I call the OCR method? Thanks. – We are Borg Sep 11 '17 at 11:07
  • Just add the fixed copy to the classpath and call tesseract the usual way, it will use your fixed copy before the library copy. – ruhsuzbaykus Sep 11 '17 at 11:40
  • Sorry, didn't work, same exception. I put that file in a different package and added that package in Module Settings->Modules->Dependencies in Intellij 13. – We are Borg Sep 11 '17 at 12:13
  • You are still using the old code then, confirm it with debugging and check your dependencies, your package with the fixed code should have precedence over tesseract package. – ruhsuzbaykus Sep 11 '17 at 12:39
  • Finally added that in libraries instead of dependencies, looks like it's working. Will add it on our server and confirm within a day. Thanks. – We are Borg Sep 13 '17 at 08:31
0

Try upgrading to tess4j version 3.4.1. That solved the issue for me.

Hari Bage
  • 31
  • 3