Tesseract adds unnecessary space in words, and interprets I as 1

Question

I use Tesseract

tesseract 5.3.0-rc1-2-gf2519 leptonica-1.82.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.4) : libpng 1.6.39 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0 and testdata_best

I am trying out some OCR using Tesseract. It outputs the result, though not perfectly. When I tried the OCR on a passport, it returned with extra spaces in between words, which is completely irrelevant. On some documents, it interpreted I as 1. Are there settings to set to improve its accuracy? This is one such example:

Expected output

P<CZESPECIMEN<<VZOR<<<<<<<<<<<<<<<<<<<<<<<<<< 99009054<4CZE6906229F16072996956220612<<<<74

Actual output

P<C ZESPE C I MEN<<VZOR<<<<<<<<-<<<<<<<<<<<<<<<<< 99009054<4C Z E6906229F16072996956220612<<<<74

Code

public static void main(String[] args) throws IOException {
        String data = "I will write this String to File in Java";
        int noOfLines = 10000;

        nu.pattern.OpenCV.loadLocally();
        Mat matImage = Imgcodecs.imread("src/main/resources/passport4.jpg");
        MatOfByte matOfByte = new MatOfByte();
        Imgcodecs.imencode(".jpg", matImage, matOfByte);
        byte[] byteArray = matOfByte.toArray();
        BufferedImage bufferedImage = ImageIO.read(new ByteArrayInputStream(byteArray));
        Binarization.returnBuffered(bufferedImage);
        BytePointer outText;
        TessBaseAPI api = new TessBaseAPI();
        api.SetPageSegMode(13);
        api.oem();
        api.SetVariable("--oem", "1");
        api.SetVariable("preserve_interword_spaces", "1");
        api.SetVariable("tessedit_timing_debug", "1");
        if (api.Init(null, "eng") != 0) {
            System.err.println("Could not initialize tesseract.");
            System.exit(1);
        }
        PIX image = pixRead(args.length > 0 ? args[0] : "saved.png");
        api.SetImage(image);
        outText = api.GetUTF8Text();
        System.out.println("OCR output:\n" + outText.getString());
        writeToFile.writeUsingFileWriter(outText.getString());
        api.End();
        outText.deallocate();
        pixDestroy(image);
    }

Tesseract adds unnecessary space in words, and interprets I as 1

0 Answers0