pyPdf: Speeding up the write / combine operation?

Question

I've got a pyPdf application combining a bunch of PDFs into one PDF and properly building a table of contents using external metadata. It works really well for some PDFs, but for others, it just seems to hang and never actually write the PDFs. I copied the write operation over into a test library to see where it was hanging and it seems to be hanging in the method '_sweepIndirectReferences' (Line 311 here). I can set it running, come back 15-20 minutes later and set a breakpoint to find that it's still resolving indirect references on the first page, with a stack 25-30 deep. If I use Acrobat to combine the files, it finishes all 200+ pages in under a minute.

I don't need my write operation to be THAT fast, but is there something I can do to speed up the process? It seems like something that can be done by Adobe in under a minute I should be able to do in less than 4 hours! I should note that it only happens on some files, not on others. My guess is that depending on how heavily the PDF relies on indirect references makes a difference.

For reference, I'm generating the pdf like this:

opened_pdfs = []
o_pdf = PdfFileWriter()

for fname in list_of_pdfs:
    i_pdf = PdfFileReader(file(fname, 'rb'))
    opened_pdfs.append(i_pdf)

    for page in i_pdf.pages:
        o_pdf.addPage(page)

of = open(file_name, 'wb')
o_pdf.write(of)
of.close()

for pdf in opened_pdfs:
    pdfs.stream.close()

This ignores the part about the bookmarks, but I think that's likely fine. The ones with problems don't have more bookmarks or anything.

score 3 · Answer 1 · answered Nov 25 '12 at 00:23

I do not have an answer but I might have a workaround: break the job up into segments and then combine the segments. That worked for the problem I have, which could be the same as yours, I did not debug it enough to find out. Also, you might look at PyPDF2, which claims to be a superset of pypdf, and see if they changed the bit of code that you see getting stuck.

I used pypdf to write a one-time script to stich together about 160 single page pdfs created by a dear octogenarian who put each page of his memoir in a separate file.

The memoir is about 50% pictures, and the file sizes of the pdfs range from 73kB to 2.5MB. The crux of the pypdf code is pretty much straight from the documentation:

for pdf_in in pdf_list:
    try:
        pdf = PdfFileReader(file(pdf_in, "rb"))
    except IOError:
        print "skipping ", pdf_in
        continue
    num_pages = pdf.getNumPages()
    if list_only:
        print pdf_in, ':', num_pages
    else:
        for i in range(num_pages):
            output.addPage(pdf.getPage(i))
        output.write(outputStream)
    total_pages += num_pages

When there were slightly fewer files, I successfully ran the script and it may have taken hours. It produced a 5GB pdf!

This weekend I updated a few files (author corrections) and tried to run it again. Coincendentally (?) my macbookpro froze up and after I rebooted, I had a 2.9GB pdf that was incomplete.

So I added this code and ran it with a seglen=35 files at a time.

if seglen:
    segments = len(pdf_list) / seglen + 1
    seglist = []
    for i in range(segments):
        outfile = kwargs['output_file'] + str(i)
        seglist.append(outfile + '.pdf')
        merge_files_in_order(pdf_list[i*seglen:(i+1)*seglen], kwargs['list_only'], outfile)
    # now stich the segments together
    merge_files_in_order(seglist, kwargs['list_only'], kwargs['output_file'])
else:
    merge_files_in_order(pdf_list, kwargs['list_only'], kwargs['output_file'])

This ran in much less time and, curiously, produced a 288MB file that is complete instead of a 2.9GB file that is incomplete (or a 5GB file like the one I created a month or so ago).

Also fun: I don't clean up the "segment" files so I can see them as well. They range in size from 195MB to 416MB, and yet when all five files were combined in the end, the resulting file is complete and only 288MB! I'm very happy.

pyPdf: Speeding up the write / combine operation?

1 Answers1