0

Is there a way to tell Tess4J to only OCR a certain amount of pages / characters?

I will potentially be working with 200+ page PDF's, but I really only want to OCR the first page, if that!

As far as I understand, the common sample

package net.sourceforge.tess4j.example;

import java.io.File;
import net.sourceforge.tess4j.*;

    public class TesseractExample {

        public static void main(String[] args) {
            File imageFile = new File("eurotext.tif");
            Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
            // Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping

            try {
                String result = instance.doOCR(imageFile);
                System.out.println(result);
            } catch (TesseractException e) {
                System.err.println(e.getMessage());
            }
        }
    }

Would attempt to OCR the entire, 200+ page into a single String.

For my particular case, that is way more than I need it to do, and I'm worried it could take a very long time if I let it do all 200+ pages and then just substring the first 500 or so.

Don Cheadle
  • 5,224
  • 5
  • 39
  • 54

1 Answers1

1

The library has a PdfUtilities class that can extract certain pages of a PDF.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • I see the `public static void splitPdf(java.lang.String inputPdfFile, java.lang.String outputPdfFile, java.lang.String firstPage, java.lang.String lastPage)` -- the inputPdfFile is a String argument (not a java.io.File?), and the page arguments are Strings as well. That confuses me. Should the inputPdfFile arg be something like "C:/users/mmcrae/MyDoc.pdf"? And should page args be like "1" or "23"? (I would expect those to be `int` arguments) – Don Cheadle Oct 23 '14 at 13:36
  • does it make sense what I'm a little confused on above? – Don Cheadle Oct 23 '14 at 21:16
  • Yes, they are string arguments to the command interpreter of GhostScript. The client program should perform input validation before calling the method. So be sure to use numeric strings for the page arguments. Thanks for pointing that out. The method could be deprecated and replaced with one with correct types. – nguyenq Oct 23 '14 at 22:26