qpdf - replace text in existing PDF file

Question

this is the first I'm working with PDFs on this level. So please be patient with my noob question. I understand the logical and physical structure of an PDF file on a basic level.

I have an PDF that contains a dummy ID that needs to be replaced. To check, if there is way to do this, I used qpdf to expand the PDF using

qpdf --qdf --object-streams=disable orig.pdf expanded.pdf

Using a hex editor I located the dummy ID in expanded.pdf and changed the value by simply swapping two digits

<001800180017> Tj => <001700170018> Tj

and saved it. Opening expanded.pdf in Acrobat didn't show the modification. The original ID 443 is still rendered, but searching for "443" doesn't find it. When searching for "334", the modified content, I get the rendered original ID 443 highlighted.

The PDF consist of text and vector graphic. When I insert additional digits (which obviously invalidates the offsets in the xref), I get an error message regarding a missing font and all digits are shown as dots but the vector graphic is still in place. This seems to indicate that the ID is not part of the graphic.

What did I miss?

EDIT 1: After mkl's comment, I did a deeper analysis of my PDF and found, that beside the obvious graphic content, all text was rendered by a series of m/l/c commands follwoed by a BT/ET section. Color for stroke and non-stroke was 0,0,0 for both in the BT/ET section.

Is this because of the used embedded non-standard font? Are PDFs with embedded fonts usually done this way? A graphics part for the visual representation and a transparent (hidden) text part just to get searching and highlighting capabilities?

Looking back I wonder what I did to get the dots when I first modified the content. I seems impossible and I can't reproduce it either.

Thanks Tom

Changing the **Tj** argument as you did usually suffices. Thus, there is something special about your pdf. Please share out for analysis. — mkl, Dec 11 '19 at 17:40
Thanks for your comment. I'm not allowed to share the PDF because of its content. — ths, Dec 14 '19 at 13:06
@ths the community would be better served if you accepted if your issue was resolved. Please consider commenting, or accepting an answer. — FabricioG, Jul 07 '22 at 23:43

score 0 · Answer 1 · answered Dec 14 '19 at 14:47

First off, the following is merely guesswork as you could not share the pdf in question. Educated guesswork but guesswork nonetheless.

You report that you changed the value by simply swapping two digits in the text drawing instruction argument and now can successfully search for the value with swapped digits but that Acrobat didn't show the modification.

Furthermore you observed that all text was rendered by a series of m/l/c commands followed by a BT/ET section.

The main situation in which one observes text being rendered as arbitrary vector graphics (a series of m/l/c commands), is in pdfs in which the producer didn't want text extraction to be possible and replaced text drawing instructions by arbitrary vector graphics instructions.

This apparently is not the case in your pdf as the text drawing instructions are not replaced but merely supplemented by the vector graphics ones.

Supposing that this construct is used for a reason and not by accident, I can only assume that the pdf producer was not willing or allowed to embed the font in question but wanted the specific font appearance to be displayed without having to count on the font being installed on the computer the pdf is viewed on.

Thus, the text appearance is drawn using arbitrary vector graphics instructions and the following text drawing instructions actually draw nothing but merely make the text searchable and extractable. This way there is no need to embed the apparent font face as font program. (Text drawing instructions can be made to draw nothing either by using a font with all blank glyphs or by using the text rendering mode "invisible".)

If this assumption turns out to be correct, your task to replace the dummy id requires not merely editing the arguments of the text drawing instructions but also replacing the arbitrary vector graphics instructions showing the dummy id appearance by other instructions showing the actual id.

If you happen to have the font in question and are willing and able to embed it, you can actually replace the arbitrary vector graphics instructions by text drawing instructions using the font. Otherwise be prepared to also draw the actual id as arbitrary vector graphics.

qpdf - replace text in existing PDF file

1 Answers1