13

Most PDFs contain lots of binary looking parts in between some ASCII. But I remember also having seen PDFs where such binary parts by and large were absent, and one could open them in a text editor to study their structure.

Is there a trick, tool, or command that will convert binary PDF parts to ASCII/ANSI? (Preferably "free as in beer" or even "free as in liberty")

jww
  • 97,681
  • 90
  • 411
  • 885

1 Answers1

30

[Updated 2014-10-15]

Using Ghostscript

Ghostscript has a small utility program written in PostScript in its source code repository. It's called pdfinflt.ps. If you are lucky, it may already slumber in a 'toolbin' subdirectory of your Ghostscript installation location. Otherwise, get it here:

Now run it together with your targeted input PDF through the Ghostscript interpreter:

gswin32c.exe -- c:/path/to/pdfinflt.ps your-input.pdf deflated-output.pdf

pdfinflt.ps will (try to) expand all 'streams' contained in the PDF which use the following compression filters/methods: /FlateDecode, /LZWDecode, /ASCII85Decode, /ASCIIHexDecode.

It will not attempt to remove /RunLengthDecode, /CCITTFaxDecode, /DCTDecode, /JBIG2Decode and /JPXDecode. (Compressed/binary fonts will also pass unchanged into the output PDF.)

If you are in an adventurous mood, you may dare to uncomment those lines in the utility which disable /RunLengthDecode, /DCTDecode and CCITTFaxDecode and see if it still works...


Using qpdf

Another useful tool to transform a PDF into an internal format that enables text editor access is qpdf. It is a "command-line program that does structural, content-preserving transformations on PDF files".

Example usage:

 qpdf                                  \
   --qdf                               \
   --object-streams=disable            \
     input-with-compressed-objects.pdf \
     output-with-expanded-objects.pdf
  1. The output of the QDF-mode enforced by the --qdf switch organizes and re-orders the objects neatly. It adds comments to track the original object IDs and page content streams. All object dictionaries are written into a "normalized" standard format for easier parsing.

  2. The --object-streams=disable causes the extraction of (otherwise not recognizable) individual objects that are compressed into another object's stream data.


Using mutool

Artifex, the creators of Ghostscript, offer another under a Free and Open Source Software license available tool: MuPDF.

MuPDF comes with a command line tool, mutool, which also can expand compressed PDF object streams:

 mutool        \
    clean      \
   -d          \
   -a          \
    input.pdf  \
    output.pdf \
    4,7,8,9
  1. clean: re-writes the PDF;
  2. -d: de-compresses all streams;
  3. -a: ASCIIhex encodes all binary streams;
  4. 4,7,8,9: selects pages 4, 7, 8 and 9 for inclusion in output.pdf.

Using pdftk

Last, here is how to use the pdtk tool to uncompress PDF object's streams:

pdftk your-input.pdf cat output uncompressed-output.pdf uncompress

Note the final uncompress word in the command line.


Pick your favorite

All above tools are available for Linux, Mac OSX, Unix and Windows.

My own favorite is QPDF for most practical cases.

However, you should make your own experiments and compare the (different) output of each of the suggested tools. Then make your own pick.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Thank you ... this works for me at least in parts. I can now better poke at all these *obj 1 0 R* parts... trying to understand that stuff better. –  Aug 14 '10 at 22:17
  • Wish I could recommend this answer ten times. My favorite application is to fix OCR errors in the hidden (copy-pastable) text behind scanned documents: convert to ascii with pdftk uncompress, find and fix the typos with the editor, then (if desired) compress again. – Silvio Levy Jul 15 '17 at 07:27
  • @SilvioLevy: Well, for a start you could upvote it at least *once*. And in lieu of not being allowed multiple votes for the same answer, you could search ***[my complete list of PDF-related answers](https://stackoverflow.com/search?q=user%3A359307+%5Bpdf%5D)*** and see if you find 9 other ones you deem worth of upvoting :-) – Kurt Pfeifle Jul 15 '17 at 09:08
  • @KurtPeifle Why do you think I have not upvoted it? Thanks for your list of answers; I will definitely look at it when I have the leisure. Thank you for your contributions. – Silvio Levy Jul 15 '17 at 09:13
  • 1
    @SilvioLevy: I just noticed that around the time of your comment there was no upvote for this answer. In fact, there was no upvote for this answer since even a few months. Sometimes, newbie StackExchanges users do not (yet) know how its sites work... No bad feelings. – Kurt Pfeifle Jul 15 '17 at 09:25
  • That's because I tried to upvote it and was told that I couldn't upvote twice -- apparently I'd seen this answer in the past and liked it. Hence my comment. (As a matter of fact I just looked it up in my browser history: it was on March 27 around 9PM pacific time (March 28 if you're in Europe). Maybe you see an upvote then? ;-) – Silvio Levy Jul 15 '17 at 09:36
  • none at all. I was just curious whether you'd find it. – Silvio Levy Jul 15 '17 at 09:41
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/149284/discussion-between-silvio-levy-and-kurt-pfeifle). – Silvio Levy Jul 15 '17 at 09:43
  • Thank you for this - I'm on mac and was able to install qpdf with homebrew. Thanks! – Peter Cullen Mar 09 '19 at 04:25
  • Homebrew? Which version did it give you? (*`'pandoc --version'`*) – Kurt Pfeifle Mar 09 '19 at 17:52
  • @jtlz2: So you liked my answer good enough to add a comment to it... but no *upvote*? – Kurt Pfeifle Apr 08 '19 at 13:04