I am working on some images using Tesseract & then trying to merge all of the generated PDFs using PoDoFo C++ library.
Have tried 2 approaches (1st one is what I require) :
- Using Tesseract C++ API & PoDoFo C++ library
My code is somewhat like this:
For OCR part (run for 001.jpg & 002.jpg):
const char* input_image = "001.jpg";
const char* output_base = "001";
const char* datapath = "/home/test/Desktop/Example2";
int timeout_ms = 5000;
const char* retry_config = nullptr;
bool textonly = false;
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(datapath, "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
tesseract::TessPDFRenderer *renderer = new tesseract::TessPDFRenderer(
output_base, api->GetDatapath(), textonly);
bool succeed = api->ProcessPages(input_image, retry_config, timeout_ms, renderer);
if (!succeed) {
fprintf(stderr, "Error during processing.\n");
return EXIT_FAILURE;
}
api->End();
return EXIT_SUCCESS;
For PDF merging part:
void mergePDF(std::vector<char*> inputfiles,char* outputfile) {
try {
/*Reading first PDF */
fprintf(stdout,"Reading file: %s\n",inputfiles[0]);
PoDoFo::PdfMemDocument doc1;
doc1.Load(inputfiles[0]);
/*Reading Second PDF */
fprintf(stdout,"Reading file: %s\n",inputfiles[1]);
PoDoFo::PdfMemDocument doc2;
doc2.Load(inputfiles[1]);
/* Appending doc1 to doc1 */
doc1.Append(doc2);
fprintf(stdout,"Writing files to %s\n ",outputfile);
doc1.Write(outputfile);
}
catch(const PoDoFo::PdfError& e) {
throw e;
}
}
int main(int argc,char* argv[]) {
if (argc < 2) {
printHelp();
exit(EXIT_FAILURE);
}
PoDoFo::PdfError::EnableDebug(false);
std::vector<char*> inputfiles;
char* outputfile;
inputfiles.emplace_back(argv[1]);
inputfiles.emplace_back(argv[2]);
outputfile = argv[3];
try {
mergePDF(inputfiles,outputfile);
}
catch(const PoDoFo::PdfError &e) {
fprintf(stderr,"Error %i occured!\n",e.GetError());
e.PrintErrorMsg();
return e.GetError();
}
exit(EXIT_SUCCESS);
}
Output:
Warning: Invalid resolution 0 dpi. Using 70 instead.
Warning: Invalid resolution 0 dpi. Using 70 instead.
Reading file: /home/test/Desktop/Example2/001.pdf
Error 17 occured!
PoDoFo encountered an error. Error: 17 ePdfError_NoEOFToken
Error Description: No EOF Marker was found in the PDF file.
Callstack:
#0 Error Source: /home/test/podofo/src/podofo/doc/PdfMemDocument.cpp:263
Information: Handler fixes issue #49
#1 Error Source: /home/test/podofo/src/podofo/base/PdfParser.cpp:272
Information: Unable to load objects from file.
#2 Error Source: /home/test/podofo/src/podofo/base/PdfParser.cpp:310
Information: EOF marker could not be found.
#3 Error Source: /home/test/podofo/src/podofo/base/PdfParser.cpp:1528
- Using Tesseract command line utility & PoDoFo C++ library
For OCR part, I use Tesseract CLI tool as follows:
tesseract 001.jpg 001 pdf
tesseract 002.jpg 002 pdf
For PDF merging part, the code is same as in point no. 1) above
Output:
Reading file: /home/test/Desktop/Example2/001.pdf
Reading file: /home/test/Desktop/Example2/002.pdf
Fixing references in 13 0 R by 12
Fixing references in 14 0 R by 12
Fixing references in 15 0 R by 12
Fixing references in 16 0 R by 12
Fixing references in 17 0 R by 12
Fixing references in 18 0 R by 12
Fixing references in 19 0 R by 12
Fixing references in 20 0 R by 12
Fixing references in 21 0 R by 12
Fixing references in 22 0 R by 12
Fixing references in 23 0 R by 12
Fixing references in 24 0 R by 12
Reading file: /home/test/Desktop/Example2/output.pdf
I wonder why I am getting the EOF marker issues after using Tesseract C++ API but no such issue after using Tesseract CLI tool.
Am I missing something in the OCR code part in point no. 1) above?