Python/PyPDF4: How do I specify the /PageLabels in the created PDF?

Question

I am using PyPDF4 to create an offline-readable version of the journal "Nature".

I use PyPDF4 PdfFileReader to read the individual article PDFs and PdfFileWriter to create a single, merged ouput.

The problem that I am trying to solve is that the page numbers of some issues do not start at 1, for example, issue 7805 starts with page 563.

How do I specify the desired /PageLabels in the document catalog?

    for pdf_file in pdf_files:
        input_pdf = PdfFileReader(open(pdf_file, 'rb'))
        page_indices = file_page_dictionary[pdf_file]
        for page_index in page_indices:
            page = input_pdf.getPage(page_index)

            # Specify actual page number here:
            # page.setPageNumber(actual_page_numbers[page_index])

            output.addPage(page)

    with open(pdf_output_name, 'wb') as f:
        output.write(f)

Use [`pypdf`](https://pypi.org/project/pypdf/) instead of PyPDF2/PyPDF3/PyPDF4. I am the maintainer of pypdf and PyPDF2. We improved pypdf a lot in 2022. — Martin Thoma, Dec 26 '22 at 08:20
https://pypdf.readthedocs.io/en/latest/modules/PdfWriter.html#pypdf.PdfWriter.set_page_label — Martin Thoma, Feb 11 '23 at 07:52
https://pypdf.readthedocs.io/en/latest/modules/PdfReader.html#pypdf.PdfReader.page_labels — Martin Thoma, Feb 11 '23 at 07:53

score 3 · Accepted Answer · answered May 15 '20 at 17:53

After exploring the PDF standard and a bit of hacking, I found that the following function will add a single PageLabels entry that creates page lables starting from offset (i.e. the first page will be labelled the offset, the second page, offset+1, etc.).

# output_pdf is an instance of PdfFileWriter().
# offset is the desired page offset.
def add_pagelabels(output_pdf, offset):
    number_type = PDF.DictionaryObject()
    number_type.update({PDF.NameObject("/S"):PDF.NameObject("/D")})
    number_type.update({PDF.NameObject("/St"):PDF.NumberObject(offset)})

    nums_array = PDF.ArrayObject()
    nums_array.append(PDF.NumberObject(0)) # physical page index
    nums_array.append(number_type)

    page_numbers = PDF.DictionaryObject()
    page_numbers.update({PDF.NameObject("/Nums"):nums_array})

    page_labels = PDF.DictionaryObject()
    page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})

    root_obj = output_pdf._root_object
    root_obj.update(page_labels)

Additional page label entries can be created (i.e. with different offsets or different numbering styles).

Note that the first PDF page has an index of 0.

# Use PyPDF to manipulate pages
from PyPDF4 import PdfFileWriter, PdfFileReader 

# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF

def pdf_pagelabels_roman():
    number_type = PDF.DictionaryObject()
    number_type.update({PDF.NameObject("/S"):PDF.NameObject("/r")})
    return number_type

def pdf_pagelabels_decimal():
    number_type = PDF.DictionaryObject()
    number_type.update({PDF.NameObject("/S"):PDF.NameObject("/D")})
    return number_type

def pdf_pagelabels_decimal_with_offset(offset):
    number_type = pdf_pagelabels_decimal()
    number_type.update({PDF.NameObject("/St"):PDF.NumberObject(offset)})
    return number_type

...
    nums_array = PDF.ArrayObject()
    # Each entry consists of an index followed by a page label...
    nums_array.append(PDF.NumberObject(0))  # Page 0:
    nums_array.append(pdf_pagelabels_roman()) # Roman numerals

    # Each entry consists of an index followed by a page label...
    nums_array.append(PDF.NumberObject(1)) # Page 1 -- 10:
    nums_array.append(pdf_pagelabels_decimal_with_offset(first_offset)) # Decimal numbers, with Offset

    # Each entry consists of an index followed by a page label...
    nums_array.append(PDF.NumberObject(10)) # Page 11 --> :
    nums_array.append(pdf_pagelabels_decimal_with_offset(second_offset))


    page_numbers = PDF.DictionaryObject()
    page_numbers.update({PDF.NameObject("/Nums"):nums_array})

    page_labels = PDF.DictionaryObject()
    page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})

    root_obj = output._root_object
    root_obj.update(page_labels)

I have been looking for something like this. When I use your code though, I'm unable to write the pdf in the final step. What was the code you used to actually write the file? I get the error `AttributeError: 'int' object has no attribute 'writeToStream'` — zoneparser, May 20 '20 at 15:40
The problem I'm trying to solve is different: retain bookmarks and pagelabels when combining several documents. Query here: https://stackoverflow.com/questions/61740267/merging-pdfs-while-retaining-custom-page-numbers-aka-pagelabels-and-bookmarks — zoneparser, May 20 '20 at 16:26
It looks like you are using a naked number. All numbers in the PageLabels structure must be wrapped with `PDF.NumberObject( xxx )`. — KevinM, May 21 '20 at 09:41
I posted an answer in https://stackoverflow.com/questions/61740267/merging-pdfs-while-retaining-custom-page-numbers-aka-pagelabels-and-bookmarks/61954278#61954278 — KevinM, May 22 '20 at 11:47

Python/PyPDF4: How do I specify the /PageLabels in the created PDF?

1 Answers1

Linked