Question is about PDF bookmarks.
When bookmarks are created, there is option to assign destination page layout (among other things) which user is encouraged not to set unless there is really reason to do so.
From time to time I run to this kind of documents, and want to remove this property for good, while retaining bookmarks of course.
I assume you understand what I ask, but if not here is this problem put differently:
My preference is to set my PDF reader (Evince) with two page display and best fit page layout, which is most comfortable to me. Now, some PDF document has set custom document layout (like page width 75% for example) in bookmark destination, and I have to correct my layout every time I use bookmark to jump to page.
Option to ignore this property in PDF reader would be great but there is not such, so I want to process this PDF files with some command-line tool and remove all this custom properties from PDF bookmarks
Update:
Here is where am I now - nowhere :)
Not only that I needed tool to "correct" this problem but also needed to know which PDF files are affected.
I used pyPdf for the job:
# chk-out.py
import sys
from pyPdf import PdfFileReader
def flat(iterable):
for element in iter(iterable):
if isinstance(element, list):
for e in flat(element):
yield e
else:
yield element
f = open(sys.argv[1], 'rb')
p = PdfFileReader(f)
try:
for outline in flat(p.getOutlines()):
if outline['/Type']:
print '[%s]: "%s"' % (outline['/Type'], sys.argv[1])
exit()
except AssertionError:
print '[***] File "%s": Feature not supported, or corrupted PDF' % sys.argv[1]
f.close()
line like:
$ for f in *.pdf ; do python chk-out.py "$f" ; done
outputs something like this:
[/Fit]: "doc1.pdf"
[/XYZ]: "doc2.pdf"
[/Fit]: "doc3.pdf"
[/FitH]: "doc4.pdf"
...
In sqare brackets it's type of destination layout.
Script is fast (couple of documents per second) easy to grasp and what not, only that pyPdf does not support writting PDF bookmarks
I thought to use pdftk
for this task:
1: dump metadata and bookmarks in separate files:
pdftk doc.pdf dump_data | grep ^Info > doc.nfo
pdftk doc.pdf dump_data | grep ^Book > doc.toc
2: try to remove bookmarks then update from "doc.toc"
2a. simply try to write "doc.toc"
pdftk doc.pdf update_info doc.toc output new.pdf
- Nothing changed
2b. write info metadata in hope that bookmark outlines will be removed:
pdftk doc.pdf update_info doc.nfo output new.pdf
- It didn't happen
2c. append BookmarkTitle: Temp title
line in "doc.nfo" in a hope that now bookmarks will be overwriten:
echo "BookmarkTitle: Temp title" >> book.nfo
pdftk doc.pdf update_info doc.nfo output new.pdf
- It didn't happen
This is where I stopped
I don't know of any other CLI tool that will let me remove bookmarks from PDF files except GhostScript with empty pdfmarks
file, but GS takes too much time to process PDF files and I want to avoid that.
Also in this process I started suspecting that this is Evince bug. Above problem is triggered only when destination bookmark type is set to /FitH
- "Fit Horizontaly" I'll assume instead "Fit Height" as that is how Evince behaves.
Same files when opened with ePDFViewer
or SumatraPDF
under Wine does not behave like Evince. Maybe it's how this PDF viewers are designed, but I remember witnessing same issue with some Windows PDF reader (can't remember which)
BTW, I'm on Ubuntu 11.04 with Evince 2.32.0