-1

I am using PDFMerger from PyPDF2. My program is basically reading all PDFs in a folder and merges them into a single one. I have made a test with 15 PDF files each is 99kb and it worked like a charm. Whole process was finished within a second. However when I tried with large numbers process took too long then I anticipated. I have tried merging 1000 files each is 99kb, reading and appending all these PDFs took 3 seconds in total but when it comes to writing the PDF it took line 67 seconds. I have tried 2 levels of merging (500 into 1 and other 500 into other 1 then merging the final 2) but it around same duration. Is there any way to speed up this writing process ?

I am adding my code below.

            merger = PdfMerger()
            for pdf in dirs:
                if pdf.endswith('pdf'):
                       merger.append(pdf)

            merger.write(filename)
            merger.close()

My PyPDF2 version is 2.11.2. Input file size is 99kb with 1 page Output file size for 1000x99kb is 20.050kb

seneill
  • 63
  • 7
  • 1
    By "default" doing it in two steps makes no difference, cause they will be done sequentially. Use process pool and delegate each step to separate process (interpreter / core, let's not go into details), merge the results. – Gameplay Dec 06 '22 at 13:07
  • 1
    There are command-line PDF tools that can do this job, no programming required. Both `pdftk` and `ConcurrnetPDF` (`cpdf`) are great PDF tools. – Tim Roberts Dec 21 '22 at 07:43
  • I have noticed a massive speed difference between a network share and a local file. Have you tested that? – Ton Plomp Mar 03 '23 at 18:30

1 Answers1

1

This is more a long comment than an answer.

I just tried this with the latest version of PyPDF2:

from PyPDF2 import PdfReader, PdfWriter
import time

reader =PdfReader("a-two-page-doc.pdf")
writer = PdfWriter()

for i in range(1000):
    writer.append(reader)


t0 = time.time()
with open("out-2000-pages.pdf", "wb") as fp:
    writer.write(fp)
t1 = time.time()

print(f"{t1-t0:.2f}s")

That took about 0.67s on my machine.

Which version of PyPDF2 did you use? Which version of Python? Is there maybe something about the specific PDF? How big is the single PDF? Did you enable some compression features?

Without a lot more details, nobody will be able to help you.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • Hi, thank you for your answer. I have added some more details. I haven't enabled compression features specifically however I noticed you are using PdfWriter instead of PdfMerger. What was your file input size and output size ? Maybe PdfMerger enables some kind of compression. – seneill Dec 27 '22 at 15:04
  • Did you try my code / replace the writer by merger? What were the results? – Martin Thoma Dec 27 '22 at 16:10