1

I have a script that combines a bunch of PDFs into a single file, using PyPDF2, all good but on the company network is really slow. I then tried PyMuPdf and it is 100 times faster, but bookmarks and metadata are not copied automatically. Is there an argument to pass or something to say "while you are copying, also don't forget the bookmarks and metadata buddy"?

A bit of code here:

def pdfMerge(try_again):
    start = time.time()
    result = fitz.open()
    for pdf in sorted_list:
        print(pdf)
        with fitz.open(pdf) as file_temp:
            result.insert_pdf(file_temp)
    if try_again == 0:
        formatted_name = f"{job_number}-Combined Set-{date}.pdf"
    else:
        formatted_name = f"{job_number}-Combined Set-{date2}.pdf"
    result.save(formatted_name)
    end = time.time()
    print(end - start)
    return formatted_name

I am also open to other options such as pikepdf (which seems better supported).

Thanks!

EDIT: I changed the code:

def pdfMerge(try_again):
    start = time.time()
    toc = []
    result = fitz.open()
    for pdf in sorted_list:
        print(pdf)
        with fitz.open(pdf) as file_temp:
            bookmarks = file_temp.get_toc()
            file_temp.set_toc(bookmarks)
            result.insert_pdf(file_temp)
            print(bookmarks)
            bookmarks = ''
    if try_again == 0:
        formatted_name = f"{job_number}-RGB-Combined Set-{date}.pdf"
    else:
        formatted_name = f"{job_number}-RGB-Combined Set-{date2}.pdf"
    result.save(formatted_name)
    end = time.time()
    print(end - start)
    return formatted_name

The print(bookmarks) shows exactly what I need, but the combined PDF is still empty. What am I doing wrong?

EDIT 2: Here is my new function:

def pdfMerge(try_again):
    start = time.time()
    toc = []
    result = fitz.open()
    bookmarks_list = []
    for pdf in sorted_list:
        with fitz.open(pdf) as file_temp:
            bookmarks = file_temp.get_toc()
            print(bookmarks)
            bookmarks_list.append(bookmarks)
            result.insert_pdf(file_temp)
    if try_again == 0:
        formatted_name = f"{job_number}-RGB-Combined Set-{date}.pdf"
    else:
        formatted_name = f"{job_number}-RGB-Combined Set-{date2}.pdf"
    print(bookmarks_list)
    result.set_toc(bookmarks_list)
    result.save(formatted_name)
    end = time.time()
    print(end - start)
    return formatted_name

Which gives me this error:

  File "C:\Users\Sav...\Coding_Python\PdfMerge\RBGPdfMerge.0.11.10.py", line 112, in <module>
    pdfMerge(try_again)
  File "C:\Users\Sav...\Coding_Python\PdfMerge\RBGPdfMerge.0.11.10.py", line 88, in pdfMerge
    result.set_toc(bookmarks_list)
  File "C:\Users\Sav...\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\fitz\utils.py", line 1325, in set_toc
    raise ValueError("hierarchy level of item 0 must be 1")
ValueError: hierarchy level of item 0 must be 1

The same files are perfectly merged with pypdf and pypdf2.

  • 1
    PyMuPdf called [outlines](https://pymupdf.readthedocs.io/en/latest/document.html#Document.outline) as bookmarks. These are under the table of contents, so probably if it's not set automatically you can try their `set_toc()` method, shown in the documentation [here](https://pymupdf.readthedocs.io/en/latest/document.html#Document.set_toc). I think in this case it's worth going into their discord server (top right of the documentation site) and ask more information as well if needed. – Ping34 Jan 17 '23 at 05:22
  • Hi, thanks for your answer and sorry it took so long. I tried the `set_toc()`, see my EDIT in the post. – Saverio Vasapollo Mar 18 '23 at 12:38

1 Answers1

1

As per the metadata:

They remain unchanged to be the metadata of the PDF into which you are merging pages from other files.

PyMuPDF allows you to view bookmarks as Table of Contents, which are very much like the same notion in a normal book: the bookmark items simply follow each other, have a level, a title and a page plus maybe some detail on exactly where on the target page it is pointing to.

So when you append PDFs to another one, you can simply also append its TOC to the TOC of the target PDF - all you must do is increasing its page numbers.

When done with appending files, set the resulting TOC (a simple Python list) to be the Table of Contents of the resulting file.

Here is an example taken directly from the PyMuPDF documentation:

>>> doc1 = fitz.open("file1.pdf")
>>> doc2 = fitz.open("file2.pdf")

>>> pages1 = len(doc1)  # save doc1's page count
>>> toc1 = doc1.get_toc(False)  # save TOC 1
>>> toc2 = doc2.get_toc(False)  # save TOC 2
>>> doc1.insert_pdf(doc2)  # doc2 at end of doc1
>>> for t in toc2:  # increase toc2 page numbers
        t[2] += pages1  # by old len(doc1)
>>> doc1.set_toc(toc1 + toc2)  # now result has total TOC
Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
  • Hi, thanks for your answer and sorry it took so long. I tried the `set_toc()`, see my EDIT in the post. – Saverio Vasapollo Mar 18 '23 at 12:39
  • @SaverioVasapollo - no problem. Just to reiterate: `.insert_pdf()` does not copy the bookmarks, because it cannot know how to do it for partial page range copying. So you must have your final output file ready before you can set its (new or updated) TOC. – Jorj McKie Mar 19 '23 at 13:56
  • I see, I need to create the combined set first then add the bookmarks. I will give it a try, thank you. – Saverio Vasapollo Mar 20 '23 at 00:07
  • Hi Jori, FYI I added EDIT 2 with the new function. Thanks. – Saverio Vasapollo Mar 22 '23 at 22:57