0

I am working on some images using Tesseract & then trying to merge all of the generated PDFs using PoDoFo C++ library.

Have tried 2 approaches (1st one is what I require) :

  1. Using Tesseract C++ API & PoDoFo C++ library

My code is somewhat like this:

For OCR part (run for 001.jpg & 002.jpg):

    const char* input_image = "001.jpg";
    const char* output_base = "001";
    const char* datapath = "/home/test/Desktop/Example2";

    int timeout_ms = 5000;
    const char* retry_config = nullptr;
    bool textonly = false;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init(datapath, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

   tesseract::TessPDFRenderer *renderer = new tesseract::TessPDFRenderer(
              output_base, api->GetDatapath(), textonly);

    bool succeed = api->ProcessPages(input_image, retry_config, timeout_ms, renderer);
    if (!succeed) {
      fprintf(stderr, "Error during processing.\n");
      return EXIT_FAILURE;
    }

    api->End();
    return EXIT_SUCCESS;

For PDF merging part:

void mergePDF(std::vector<char*> inputfiles,char* outputfile) {
    try {

        /*Reading first PDF */
        fprintf(stdout,"Reading file: %s\n",inputfiles[0]);
        PoDoFo::PdfMemDocument doc1;
        doc1.Load(inputfiles[0]);

        /*Reading Second PDF */
        fprintf(stdout,"Reading file: %s\n",inputfiles[1]);
        PoDoFo::PdfMemDocument doc2;
        doc2.Load(inputfiles[1]);


        /* Appending doc1 to doc1 */
        doc1.Append(doc2);


        fprintf(stdout,"Writing files to %s\n ",outputfile);
        doc1.Write(outputfile);
    }
    catch(const PoDoFo::PdfError& e) {
        throw e;
    }
}

int main(int argc,char* argv[]) {
    if (argc < 2) {
        printHelp();
        exit(EXIT_FAILURE);
    }
    
     PoDoFo::PdfError::EnableDebug(false);
     std::vector<char*> inputfiles;
     char* outputfile;

     inputfiles.emplace_back(argv[1]);
     inputfiles.emplace_back(argv[2]);
     outputfile = argv[3];
     try {
         mergePDF(inputfiles,outputfile);
     }
     catch(const PoDoFo::PdfError &e) {
         fprintf(stderr,"Error %i occured!\n",e.GetError());
         e.PrintErrorMsg();
         return e.GetError();
     }
     exit(EXIT_SUCCESS);
}

Output:

Warning: Invalid resolution 0 dpi. Using 70 instead.
Warning: Invalid resolution 0 dpi. Using 70 instead.

Reading file: /home/test/Desktop/Example2/001.pdf
Error 17 occured!


PoDoFo encountered an error. Error: 17 ePdfError_NoEOFToken
    Error Description: No EOF Marker was found in the PDF file.
    Callstack:
    #0 Error Source: /home/test/podofo/src/podofo/doc/PdfMemDocument.cpp:263
        Information: Handler fixes issue #49
    #1 Error Source: /home/test/podofo/src/podofo/base/PdfParser.cpp:272
        Information: Unable to load objects from file.
    #2 Error Source: /home/test/podofo/src/podofo/base/PdfParser.cpp:310
        Information: EOF marker could not be found.
    #3 Error Source: /home/test/podofo/src/podofo/base/PdfParser.cpp:1528
  1. Using Tesseract command line utility & PoDoFo C++ library

For OCR part, I use Tesseract CLI tool as follows:

tesseract 001.jpg 001 pdf
tesseract 002.jpg 002 pdf

For PDF merging part, the code is same as in point no. 1) above

Output:

Reading file: /home/test/Desktop/Example2/001.pdf
Reading file: /home/test/Desktop/Example2/002.pdf
Fixing references in 13 0 R by 12
Fixing references in 14 0 R by 12
Fixing references in 15 0 R by 12
Fixing references in 16 0 R by 12
Fixing references in 17 0 R by 12
Fixing references in 18 0 R by 12
Fixing references in 19 0 R by 12
Fixing references in 20 0 R by 12
Fixing references in 21 0 R by 12
Fixing references in 22 0 R by 12
Fixing references in 23 0 R by 12
Fixing references in 24 0 R by 12
Reading file: /home/test/Desktop/Example2/output.pdf

I wonder why I am getting the EOF marker issues after using Tesseract C++ API but no such issue after using Tesseract CLI tool.

Am I missing something in the OCR code part in point no. 1) above?

dashthird
  • 47
  • 9
  • 2
    If you are using PoDoFo only to merge the pdfs generated then you should try to generate a single pdf from multiple images. Try api->ProcessPages() multiple times with different input image and same renderer. And see if rhe output is concatinated for you in a single pdf. As for tesseract api vs cli tool they behave differently here and there with default setting. Try processPages() multiple times or see if a comma separated list of image name works. – saumitra mallick Mar 24 '21 at 08:31
  • @saumitramallick Yes, your method works! – dashthird Apr 04 '21 at 08:12

0 Answers0