0

Is there a definitive way to check an existing PDF for only the header and footer to be populated?

I run a console app that merges existing PDFs together into a single PDF (FileA.pdf, FileB.pdf and FileC.pdf become FileABC.pdf).

The caveat is that I need to check to see if the PDF is empty or populated. If the file is empty, it is ignored from the merge (FileB.pdf is empty so the merged file is FileAC.pdf). However, an empty file still will have a header and footer, just no body content and I need to account for that.

What I do currently is that I manually gather the byte size of all the different empty PDFs, and I use System.IO.FileInfo to compare if the file size is bigger than the recorded empty size.

While this works 99% of the time, sometimes there is an anomaly and an empty PDF is 1 to 2 bytes higher allowing an empty file slips through.

  • 1
    It is possible, but not super easy. Your best bet would be to use a third party code library such as iTextSharp to parse the PDF and give you in-code objects that represent the elements within the PDF (tables, blocks of text, images, etc.). At that point you would need to write custom code to loop through the element objects and apply your logic. For example, if there are zero elements between the header element and footer element, or if there ARE elements but they're all text blocks containing only whitespace, then you know it's an "empty" PDF. – Joe Irby Jul 06 '18 at 21:33
  • Indeed, counting on FileInfo alone won't help, you'll need a pdf library. But don't expect to easily extract tables from the pdf. Usually a table in a pdf is merely a collection of text pieces and possibly some lines or background coloration rectangles... – mkl Jul 07 '18 at 06:29

0 Answers0