3

I have used pdftk to change the "Info" metadata associated with a PDF. I currently have several PDFs with extraneous page labels and I cannot figure how to drop them. This is what I am currently doing:

$ pdftk example_orig.pdf dump_data output page_labels.orig
$ grep -v PageLabel page_labels.orig > page_labels.new
$ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf

This does not remove the PageLabel* metadata which can be verified with:

$ pdftk example_orig.pdf dump_data | grep PageLabel

How can I programmatically remove this metadata from the PDF? It would be nice to do with with pdftk but if there another tool or way to do this on GNU/Linux, that would also work for me.

I need this because I am using LaTeX Beamer to generate presentations with the \setbeameroption{show notes on second screen} option which generates a double-width PDF for showing notes on a second screen. Unfortunately, there seems to be a bug in pgfpages which results in incorrect and extraneous PageLabels in these files (example). If I generate a slides only PDF, it will generates the correct PageLabels (example). Since I can generate a correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with those in the second. That said, since there are extra pagelabels in the first example, I would need to remove them first.

mako
  • 473
  • 6
  • 14
  • 1
    Seeing a sample PDF of yours would help tremendously to understand what exactly is wrong/extraneous with your PageLabels, and following this, to offer a working solution... – Kurt Pfeifle Sep 14 '14 at 09:53
  • Thanks for your feedback @KurtPfeifle. I have included examples with incorrect and correct PageLabels and described how I am building these PDFs. Correct PageLabels are important because they are used as hints for the presentation software I am using. – mako Sep 14 '14 at 22:19

2 Answers2

7

Using a text editor to remove PDF metadata

  1. If it is the first time you edit a PDF, make a backup copy first.

  2. Open your PDF with a text editor that can handle binary blobs. vim -b will be fine.

  3. Locate the /Info dictionary. Overwrite all the entries you do not want any more completely with blanks (an entry consists of /Key names plus the (some values) following them).

  4. Be careful to not use more spaces than there were characters initially. Otherwise your xref table (ToC of PDF objects will be invalidated, and some viewers will indicate the PDF as corrupted).

  5. For additional measure, locate the /XML string in your PDF. It should show you where your XMP/XML metadata section is (not all PDFs have them). Locate all the key values (not the <something keys>!) in there which you want to remove. Again, just overwrite them with blanks and be careful not to change the total length (neither longer, nor shorter).

In case your PDF does not make the /Info dictionary accessible, transform it with the help of qpdf.

  1. Use this command:

    qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf
    
  2. Apply the procedure outlined above. (The qdf---orig.pdf now should be much better suited for

  3. Re-compact your edited file:

    qpdf qdf---orig.pdf  edited---orig.pdf
    
  4. Done! Enjoy your edited---orig.pdf. Check if it has all the data removed:

    pdfinfo -meta edited---orig.pdf
    

Update

After looking at the sample PDF files provided, it became clear to me that the /PageLabel key is not part of the /Info dictionary (PDF's Document Information Dictionary), but of the /Root object.

That's probably one reason why pdftk was unable to update it with the method the OP described.

The other reason is the following: the PDF which the OP quoted as containing the correct page labels does in fact contain incorrect ones!

 Logical Page No. |  Page Label
 -----------------+------------
               1  |   1
               2  |   2
               3  |   2
               4  |   2
               5  |   2
               6  |   4

The other PDF (which supposedly contains extraneous page labels) is incorrect in a different way:

 Logical Page No. |  Page Label
 -----------------+------------
               1  |   1
               2  |   1
               3  |   2
               4  |   2
               5  |   2
               6  |   4

My original advice about how to manually edit the classical metadata of a PDF remains valid. For the case of editing page labels you can apply the same method with a slight variation.

In the case of the OP's example files, the complication comes into play: the /Root object is not directly accessible, because it is hidden inside a compressed object stream (PDF object type /ObjStm). That means one has to decompress it with the help of qpdf first:

  1. Use qpdf:

    qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf
    
  2. Open the resulting file in binary mode with vim:

    vim -b q-notes.pdf
    
  3. Locate the 1 0 obj marker for the beginning of the /Root object, containing a dictionary named /PageLabels.

    (a) To disable page labels altogether, just replace the /PageLabels string by /Pagelabels, using a lowercase 'l' (PDF is case sensitive, and will no longer recognize the keyword; you yourself could at some other time restore the original version should you need it.)

    (b) To edit the page labels, first see how the consecutive labels for pages 1--6 are being referred to as

       <feff0031>
       [....] 
       <feff0032>
       [....] 
       <feff0032>
       [....] 
       <feff0032>
       [....] 
       <feff0033>
       [....] 
       <feff0034>
    

    (These values are in BOM-marked hex, meaning 1, 2, 2, 2, 3, 4...)

    Edit these values to read:

        <feff0031>
        [....] 
        <feff0032>
        [....] 
        <feff0033>
        [....] 
        <feff0034>
        [....] 
        <feff0035>
        [....] 
        <feff0036>
    
  4. Save the file and run qpdf again in order to re-compress the PDF:

    qpdf q-notes.pdf notes.pdf
    

    These now hopefully are the page labels the OP is looking for....

Since the OP seems to be familiar with editing pdftk's output of dump_data output, he can possibly edit the output and use update_data to apply the fix to the PDF without needing to resort to qpdf and vim.


Update 2:

User @Iserni posted a very good, short and working answer, which limits itself to one command, pdftk, which the OP seems to be familiar with already, plus sed -- not needing to use a text editor to open the PDF, and not introducing an additional utility qpdf like my answer did.

Unfortunately @Iserni deleted it again after a comment of mine. I think his answer deserves to get the bounty and I call you to vote to "undelete" his answer!

So temporarily, I'll include a copy of @Iserni's answer here, until his is undeleted again:

Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.

# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress

# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf

# Recompress
pdftk mangled.pdf output final.pdf compress

# Remove temp file
rm -f temp.pdf mangled.pdf
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • My problem is that software generating PDFs is getting the page labels incorrect. Is there any way to do this programtically? – mako Sep 13 '14 at 19:06
  • 1
    @BenjaminMakoHill: Hmmm... so why didn't you ask ***this***? And why don't you name the 'generating software' (or at least describe some more details about its functions)? You know, some PDF-'generating software' does have switches to influence its results, some doesn't. You leave me guessing... – Kurt Pfeifle Sep 14 '14 at 09:44
  • I'm updated my question to provide much more context. Thank you so much for taking the time with my question! I really appreciate it! – mako Sep 14 '14 at 22:20
  • 1
    @Iserni: I'm sorry you deleted your answer! My comment was not at all meant to cause you to do that. I honestly appreciated your short method, and I upvoted your answer, and I'd be glad to see your answer harvesting the bounty. Please un-delete your answer again! (Wait... I voted to undelete it.) – Kurt Pfeifle Sep 16 '14 at 21:23
  • I did understand and appreciate your comment (and your upvote :-) ). But I believe that *I* am in the wrong for not checking more thoroughly your answer (if I had, I would perhaps not even have answered in the first place!). Please don't worry - no harm was done, no hard feelings, and there will be other occasions anyway! – LSerni Sep 17 '14 at 06:28
  • @lserni, Can you undelete your answer? I haven't had a chance to read this closely yet but I think your answer works and is the correct one. – mako Sep 17 '14 at 22:47
  • It is the same quoted in Kurt's own answer. It worked for me on your test files - or so I think - but so should Kurt's. If the latter does not, I'm afraid mine won't either. However, here it goes. Tell us how it works... – LSerni Sep 17 '14 at 22:55
  • @KurtPfeifle. Your first table in your update is wrong but the labels *are* correct. You quote the correct labels (i.e., 1, 2, 2, 2, 3, 4) later. The PageLable "2" is *intentionally* repeated and this is clear if you look a the page number printed on the corner of the slides. Those "repeats" of the #2 page label are actually "reveals" of additional information on the slide. [This PDF slide presentation software](https://davvil.github.io/pdfpc/) is aware of these repeats and will, for example, skip reveal sub-slides when going backwards. This is precisely the data I am trying to preserve. – mako Sep 18 '14 at 05:02
  • @BenjaminMakoHill: In this case (that the *correct* labels should be 1, 2, 2, 2, 3, 4) is probably also not correct. His answer does result in basically the same PDF as mine, albeit with "simpler" means: he (like I) just discards all page labels (which makes the labels fall back to the logical page numbers). Maybe modifying the suggested answer(s) by inserting your step of `pdftk ... update_info ...` will be able to insert correct page labels (after invalidating the existing ones, like we suggested). – Kurt Pfeifle Sep 18 '14 at 21:15
  • @KurtPfeifle: Actually, `pdftk ... update_info ...` only seems to be able to *change* data, not remove it! That said, both your answer (and lserni's related simple answer) answer the question as asked. I'll ask a new one about adding new metadata. Thank you both *so much* for spending so much effort on this question! Simply getting the incorrect data "out" is a huge improvement for me. – mako Sep 18 '14 at 21:34
  • @BenjaminMakoHill: Maybe I still do not fully understand... But what we both suggested would *remove* the data, because no compliant PDF reader will be able to read keys which have wrong or wrong-case spelling. That means, to them, it looks like the keys would not be present (effectivly, be removed). OTOH, `pdftk` normally can *add* stuff to (at least) metadata (have to make some more tests with PageLabels, which I had not looked at before), not just change it. From your original question I conclude that it did ***NOT*** change the existing PageLabels. Now you say it ***can*** change data??? – Kurt Pfeifle Sep 18 '14 at 22:17
  • @KurtPfeifle: I completely understand both why and how this solution works. When I tried yesterday, I could not get `pdftk ... update_info ...` to add PageLabels fields to a PDF — even after I had "removed" the old ones with `sed` first. Does this work for you? The first sentence of my question states that I have used `pdftk` to change PDF metadata (although never PageLabels) in the past. – mako Sep 18 '14 at 22:32
  • @BenjaminMakoHill: I just checked versions 1.44 and 2.02 of `pdftk`. Both were unable to write page labels (with data dumped from your \*SLIDES.pdf) to the most simple, 6-empty-pages-PDF created by `gs -o 6p.pdf -sDEVICE=pdfwrite -c "showpage showpage showpage showpage showpage showpage"`. (1/2) – Kurt Pfeifle Sep 19 '14 at 00:30
  • @BenjaminMakoHill: (cont'd) Checking more closely the output of `pdftk --help` (both versions), there is no mention that it can update/change/insert page labels with `update_info`. It only states: "*Changes the **metadata** stored in a single PDF's Info dictionary to match input data*" -- For `dump_data` however: it "*reports various statistics, **metadata, bookmarks (a/k/a outlines), and page labels***". So probably, the observed behavior isn't a bug. We expected too much -- the feature is not supported... :-( (2/2) – Kurt Pfeifle Sep 19 '14 at 00:31
  • @BenjaminMakoHill: ...which brings us back to manually edit the desired `/PageLabels` info into the PDF file. – Kurt Pfeifle Sep 19 '14 at 00:32
6

Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.

# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress

# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf

# Recompress
pdftk mangled.pdf output final.pdf compress

rm -f temp.pdf mangled.pdf
LSerni
  • 55,617
  • 10
  • 65
  • 107
  • This is a rip-off of my answer suggesting to **(1)** *Uncompress PDF*; **(2)** *Replace `/PageLabels` by `/Pagelabels`...*; **(3)** *Recompress PDF* :-) -- Admittedly, while this answer may not provide as much educational insight, it is shorter, avoids direct editing the PDF, remains in the realm of a single `pdftk` commandline utility and overall works faster. Therefor it deserves all upvotes it can get, including my own :-) – Kurt Pfeifle Sep 16 '14 at 12:48
  • I am sorry, I hadn't realized :-( -- since you have enough reputation to read comments in deleted posts, I'm saving some time and removing the answer straight away. I'd appreciate it if you included my little script in *your* answer though :-). – LSerni Sep 16 '14 at 16:29