1

I try to extract the text of a pdf via iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage, which does not work because of some bad formatting of the pdf file with respect to an inline picture.

I figured out that I can fix this problem, if I (A) open the pdf in Adobe Acrobat and save it as an optimized pdf. Then the parsing would work. Or (B) I would open it in Adobe Acrobat and print it again via Adobe PDF as pdf.

Now I have 14.000 of these files and want to automate (A) or (B). But somehow I cannot succeed.

For (A) I included the Adobe library and do in short something like this

mApp = new AcroAppClass();
avDoc = new AcroAVDocClass();
avDoc.Open (strFilePath, "");
pdDoc  = (CAcroPDDoc)avDoc.GetPDDoc ();
pdDoc.Save(1, strFilePath.Substring(0, strFilePath.Length - 4) + "_changed.pdf");

But Adobe SDK does not allow me to save as a different format.

For (B) it tried something like this:

Process pdfProcess = new Process();
pdfProcess.StartInfo.FileName = @"C:\Program Files (x86)\Adobe\Acrobat 11.0\Acrobat\AcroRd32.exe";
pdfProcess.StartInfo.Arguments = string.Format(@"/t", strFilePathSource, "Adobe PDF", "Adobe PDF", strFilePathTarget);
pdfProcess.Start();

This is not throwing any error, but there is also no file produced.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
Graffl
  • 380
  • 3
  • 9
  • It seems like a huge work to do. It would take less time and money if only you shared the PDFs with iText Software so that we can work on a fix. Note that we sign an NDA if you're a customer. – Bruno Lowagie Jun 04 '14 at 14:42
  • According to the [docs](http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/iac_api_reference.pdf) on page 84 the first parameter to `Save` is a logical `OR` of a couple of options. Looks like you want `PDSaveLinearized` which appears to be `0x04` along with `PDSaveFull` which is `0x01` so basically `5`. – Chris Haas Jun 04 '14 at 19:18
  • If this is a one time thing, you can open Acrobat and optimize all the files at one time using Acrobat's action wizard. Save and export, save. – Phil Sattele Jun 05 '14 at 15:17

0 Answers0