Duplicating PDF with PyPDF2 gives blank pages

Question

I'm using PyPDF2 to alter a PDF document (adding bookmarks). So I need to read in the entire source PDF, and write it out, keeping as much of the data intact as possible. Merely writing each page into a new PDF object may not be sufficient to preserve document metadata.

PdfFileWriter() does have a number of methods for copying an entire file: cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot. However, they all have problems.

If I use cloneDocumentFromReader or appendPagesFromReader, I get a valid PDF file, with the correct number of pages, but all pages are blank.

If I use cloneReaderDocumentRoot, I get a minimal valid PDF file, but with no pages or data.

This has been asked before, but with no successful answers. Other questions have asked about Blank pages in PyPDF2, but I can't apply the answer given.

Here's my code:

def bookmark(incomingFile):
    reader = PdfFileReader(incomingFile)
    writer = PdfFileWriter()

    writer.appendPagesFromReader(reader)
    #writer.cloneDocumentFromReader(reader)
    my_table_of_contents = [
            ('Page 1', 0), 
            ('Page 2', 1),
            ('Page 3', 2)
            ]
    # writer.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
    for title, pagenum in my_table_of_contents:
        writer.addBookmark(title, pagenum, parent=None)

    writer.setPageMode("/UseOutlines")

    with open(incomingFile, "wb") as fp:
        writer.write(fp)

I tend to get errors when PyPDF2 can't add a bookmark to the PdfFileWriter object, because it doesn't have any pages, or similar.

Don't you think it's easier to read if (1) closing file handles is avoided by using context managers / built-in methods (2) using the reader/writer variable names as it's in the PyPDF2 docs (3) Using snake_case variable names as PEP8 suggests / most of the Python community does? If you don't like the change, feel free to revert. — Martin Thoma, May 01 '22 at 14:42

score 2 · Accepted Answer · answered May 06 '19 at 18:14

I also wrestled with this a lot, finally found that PyPDF2 has this issue. Basically I copied this answer's code into C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py (this will depend on your distribution) around line 382 for the cloneDocumentFromReader function.

After that I was able to append the reader pages to the writer with writer.cloneDocumentFromReader(pdf) and, in my case, to update PDF Metadata (Subject, Keywords, etc.).

Hope this helps you

    '''
    Create a copy (clone) of a document from a PDF file reader

    :param reader: PDF file reader instance from which the clone
        should be created.
    :callback after_page_append (function): Callback function that is invoked after
        each page is appended to the writer. Signature includes a reference to the
        appended page (delegates to appendPagesFromReader). Callback signature:

        :param writer_pageref (PDF page reference): Reference to the page just
            appended to the document.
    '''
    debug = False
    if debug:
        print("Number of Objects: %d" % len(self._objects))
        for obj in self._objects:
            print("\tObject is %r" % obj)
            if hasattr(obj, "indirectRef") and obj.indirectRef != None:
                print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))

    # Variables used for after cloning the root to
    # improve pre- and post- cloning experience

    mustAddTogether = False
    newInfoRef = self._info
    oldPagesRef = self._pages
    oldPages = self.getObject(self._pages)

    # If there have already been any number of pages added

    if oldPages[NameObject("/Count")] > 0:

        # Keep them

        mustAddTogether = True
    else:

        # Through the page object out

        if oldPages in self._objects:
            newInfoRef = self._pages
            self._objects.remove(oldPages)

    # Clone the reader's root document

    self.cloneReaderDocumentRoot(reader)
    if not self._root:
        self._root = self._addObject(self._root_object)

    # Sweep for all indirect references

    externalReferenceMap = {}
    self.stack = []
    newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)

    # Delete the stack to reset

    del self.stack

    #Clean-Up Time!!!

    # Get the new root of the PDF

    realRoot = self.getObject(newRootRef)

    # Get the new pages tree root and its ID Number

    tmpPages = realRoot[NameObject("/Pages")]
    newIdNumForPages = 1 + self._objects.index(tmpPages)

    # Make an IndirectObject just for the new Pages

    self._pages = IndirectObject(newIdNumForPages, 0, self)

    # If there are any pages to add back in

    if mustAddTogether:

        # Set the new page's root's parent to the old
        # page's root's reference

        tmpPages[NameObject("/Parent")] = oldPagesRef

        # Add the reference to the new page's root in
        # the old page's kids array

        newPagesRef = self._pages
        oldPages[NameObject("/Kids")].append(newPagesRef)

        # Set all references to the root of the old/new
        # page's root

        self._pages = oldPagesRef
        realRoot[NameObject("/Pages")] = oldPagesRef

        # Update the count attribute of the page's root

        oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])

    else:

        # Bump up the info's reference b/c the old
        # page's tree was bumped off

        self._info = newInfoRef

Duplicating PDF with PyPDF2 gives blank pages

1 Answers1