I've got a pyPdf application combining a bunch of PDFs into one PDF and properly building a table of contents using external metadata. It works really well for some PDFs, but for others, it just seems to hang and never actually write the PDFs. I copied the write operation over into a test library to see where it was hanging and it seems to be hanging in the method '_sweepIndirectReferences' (Line 311 here). I can set it running, come back 15-20 minutes later and set a breakpoint to find that it's still resolving indirect references on the first page, with a stack 25-30 deep. If I use Acrobat to combine the files, it finishes all 200+ pages in under a minute.
I don't need my write operation to be THAT fast, but is there something I can do to speed up the process? It seems like something that can be done by Adobe in under a minute I should be able to do in less than 4 hours! I should note that it only happens on some files, not on others. My guess is that depending on how heavily the PDF relies on indirect references makes a difference.
For reference, I'm generating the pdf like this:
opened_pdfs = []
o_pdf = PdfFileWriter()
for fname in list_of_pdfs:
i_pdf = PdfFileReader(file(fname, 'rb'))
opened_pdfs.append(i_pdf)
for page in i_pdf.pages:
o_pdf.addPage(page)
of = open(file_name, 'wb')
o_pdf.write(of)
of.close()
for pdf in opened_pdfs:
pdfs.stream.close()
This ignores the part about the bookmarks, but I think that's likely fine. The ones with problems don't have more bookmarks or anything.