I use Tesseract
tesseract 5.3.0-rc1-2-gf2519 leptonica-1.82.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.4) : libpng 1.6.39 : libtiff 4.4.0 : zlib 1.2.11 : libwebp 1.2.4 : libopenjp2 2.5.0 and testdata_best
I am trying out some OCR using Tesseract. It outputs the result, though not perfectly. When I tried the OCR on a passport, it returned with extra spaces in between words, which is completely irrelevant. On some documents, it interpreted I as 1. Are there settings to set to improve its accuracy? This is one such example:
Expected output
P<CZESPECIMEN<<VZOR<<<<<<<<<<<<<<<<<<<<<<<<<< 99009054<4CZE6906229F16072996956220612<<<<74
Actual output
P<C ZESPE C I MEN<<VZOR<<<<<<<<-<<<<<<<<<<<<<<<<< 99009054<4C Z E6906229F16072996956220612<<<<74
Code
public static void main(String[] args) throws IOException {
String data = "I will write this String to File in Java";
int noOfLines = 10000;
nu.pattern.OpenCV.loadLocally();
Mat matImage = Imgcodecs.imread("src/main/resources/passport4.jpg");
MatOfByte matOfByte = new MatOfByte();
Imgcodecs.imencode(".jpg", matImage, matOfByte);
byte[] byteArray = matOfByte.toArray();
BufferedImage bufferedImage = ImageIO.read(new ByteArrayInputStream(byteArray));
Binarization.returnBuffered(bufferedImage);
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
api.SetPageSegMode(13);
api.oem();
api.SetVariable("--oem", "1");
api.SetVariable("preserve_interword_spaces", "1");
api.SetVariable("tessedit_timing_debug", "1");
if (api.Init(null, "eng") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
PIX image = pixRead(args.length > 0 ? args[0] : "saved.png");
api.SetImage(image);
outText = api.GetUTF8Text();
System.out.println("OCR output:\n" + outText.getString());
writeToFile.writeUsingFileWriter(outText.getString());
api.End();
outText.deallocate();
pixDestroy(image);
}