Using a text editor to remove PDF metadata
If it is the first time you edit a PDF, make a backup copy first.
Open your PDF with a text editor that can handle binary blobs. vim -b
will be fine.
Locate the /Info
dictionary. Overwrite all the entries you do not want any more completely with blanks (an entry consists of /Key
names plus the (some values)
following them).
Be careful to not use more spaces than there were characters initially. Otherwise your xref
table (ToC of PDF objects will be invalidated, and some viewers will indicate the PDF as corrupted).
For additional measure, locate the /XML
string in your PDF. It should show you where your XMP/XML metadata section is (not all PDFs have them). Locate all the key values (not the <something keys>
!) in there which you want to remove. Again, just overwrite them with blanks and be careful not to change the total length (neither longer, nor shorter).
In case your PDF does not make the /Info
dictionary accessible, transform it with the help of qpdf
.
Use this command:
qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf
Apply the procedure outlined above. (The qdf---orig.pdf
now should be much better suited for
Re-compact your edited file:
qpdf qdf---orig.pdf edited---orig.pdf
Done! Enjoy your edited---orig.pdf
. Check if it has all the data removed:
pdfinfo -meta edited---orig.pdf
Update
After looking at the sample PDF files provided, it became clear to me that the /PageLabel
key is not part of the /Info
dictionary (PDF's Document Information Dictionary), but of the /Root
object.
That's probably one reason why pdftk
was unable to update it with the method the OP described.
The other reason is the following: the PDF which the OP quoted as containing the correct page labels does in fact contain incorrect ones!
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 2
3 | 2
4 | 2
5 | 2
6 | 4
The other PDF (which supposedly contains extraneous page labels) is incorrect in a different way:
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
6 | 4
My original advice about how to manually edit the classical metadata of a PDF remains valid. For the case of editing page labels you can apply the same method with a slight variation.
In the case of the OP's example files, the complication comes into play: the /Root
object is not directly accessible, because it is hidden inside a compressed object stream (PDF object type /ObjStm
). That means one has to decompress it with the help of qpdf
first:
Use qpdf
:
qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf
Open the resulting file in binary mode with vim
:
vim -b q-notes.pdf
Locate the 1 0 obj
marker for the beginning of the /Root
object, containing a dictionary named /PageLabels
.
(a) To disable page labels altogether, just replace the /PageLabels
string by /Pagelabels
, using a lowercase 'l' (PDF is case sensitive, and will no longer recognize the keyword; you yourself could at some other time restore the original version should you need it.)
(b) To edit the page labels, first see how the consecutive labels for pages 1--6 are being referred to as
<feff0031>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
(These values are in BOM-marked hex, meaning 1, 2, 2, 2, 3, 4...)
Edit these values to read:
<feff0031>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
[....]
<feff0035>
[....]
<feff0036>
Save the file and run qpdf
again in order to re-compress the PDF:
qpdf q-notes.pdf notes.pdf
These now hopefully are the page labels the OP is looking for....
Since the OP seems to be familiar with editing pdftk
's output of dump_data
output, he can possibly edit the output and use update_data
to apply the fix to the PDF without needing to resort to qpdf
and vim
.
Update 2:
User @Iserni posted a very good, short and working answer, which limits itself to one command, pdftk
, which the OP seems to be familiar with already, plus sed
-- not needing to use a text editor to open the PDF, and not introducing an additional utility qpdf
like my answer did.
Unfortunately @Iserni deleted it again after a comment of mine. I think his answer deserves to get the bounty and I call you to vote to "undelete" his answer!
So temporarily, I'll include a copy of @Iserni's answer here, until his is undeleted again:
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
# Remove temp file
rm -f temp.pdf mangled.pdf