1

I'm looking at adding an image to an existing PDF in Node.js. None of the PDF libraries I found appear to have the ability to modify an existing PDF though, so I'm planning on implementing it myself. I'm trying to figure out if it's too much work, as I can always do it server side using iTextPDF instead, but I'd prefer to do it in my app (Electron which uses Node.js).

If I just want to modify an existing PDF and add an image, will I have to write a complete rendering library or is PDF structured in such a way that I can write a very small parser that just gets the page I want and inserts an image using the correct format?

Specifically, I'm asking because I've previously looked into writing a text extraction library, put in order to get the position of text you have to render pretty much the entire PDF because of how positioning is handled. That's too much work to get around server side processing in this case.

To be clear, just asking if it's possible to do, not how to do it (don't want to be too broad, I'm sure I can figure that part out).

Brandon
  • 16,382
  • 12
  • 55
  • 88

1 Answers1

1

To perform a small manipulation of a PDF, you'll need to implement generalized reading, decompression, encryption and traversal of PDF data structures. Some of the thing you would need to handle include:

  • basic parsing of PDF syntax
  • indexing via the cross reference index, and/or cross reference index and object streams
  • objects (num, byte-string, hex string, dictionary, arrays, booleans...)
  • filters and variants (LZW, Flate, RunLength, Predictors)
  • encryption (RC4, AES, Custom security handlers)
  • page tree traversal
  • basic handling of page content streams
  • image handling
  • serialization, either rewriting of the entire PDF, or incremental updates to an existing PDF

Anything's possible, but realistically, you will need a PDF library or toolkit, client or server-side, to accomplish this.

dwarring
  • 4,794
  • 1
  • 26
  • 38
  • See also thie answer to this question http://stackoverflow.com/questions/34361609/is-it-possible-to-reinitialize-the-graphics-state-in-a-pdf-file. If you want to position an image absolutely on a page, it can be a good idea to first wrap existing page content in `q` .... `Q` (save/restore) to restore page graphics to its intitial state. Some high level tool-kits may do this for you – dwarring Dec 20 '15 at 22:29
  • It looks like it may be a lot simpler then that. My research indicates that you can simply append new content to the end of a PDF without altering anything in the original. See http://blog.didierstevens.com/2008/05/07/solving-a-little-pdf-puzzle/ and https://blog.idrsolutions.com/2012/11/understanding-the-pdf-file-format-multiple-trailers-on-a-pdf-file/ for more info. Still experimenting with this. – Brandon Dec 21 '15 at 01:29
  • Agree, incremental update is probably the easiest way to write changes to the PDF. But reading of PDF is the hardest part. I stand by my answer, including: parsing, objects, filters, possible encryption and page tree traversal, and some serialization. – dwarring Dec 21 '15 at 02:15
  • Here's another link from the idrsolutions blog that you linked to. This one's about parsing PDFs, which is just part of the problem: https://blog.idrsolutions.com/2011/07/why-writing-a-pdf-parser-is-such-a-challenging-task-part-234/ – dwarring Dec 21 '15 at 02:49
  • All my PDFs will be coming from one source (wkhtmltopdf) so theoretically I can take a bunch of shortcuts and implement only part of the PDF spec (enough to insert my images). Not ideal, but better than server side processing or spending a lot of time writing a standards compliant parser. – Brandon Dec 21 '15 at 03:09
  • 1
    OK, you definitely need to wrap wkhtmltopdf generated content in `q` ... `Q` before appending your image. See also this answer - http://stackoverflow.com/questions/25524492/manipulating-a-pdf-file-with-different-rotations-and-scaling-with-perls-pdfap/29067780#29067780 – dwarring Dec 21 '15 at 03:34