Python - Split pdf by pages

Question

I am using PyPdf2 to split large PDF to pages. The problem is that this process is very slow.

This is the code i use:

import os
from PyPDF2 import PdfFileWriter, PdfFileReader

with open(input_pdf_path, "rb") as input_file:
    input_pdf = PdfFileReader(input_file)
    directory = "%s/paging/" % os.path.dirname(input_pdf_path)
    if not os.path.exists(directory):
        os.makedirs(directory)

    page_files = []
    for i in range(0, input_pdf.numPages):
        output = PdfFileWriter()
        output.addPage(input_pdf.getPage(i))
        file_name = "%s/#*#*#*##-%s.pdf" % (directory, i)
        page_files.append(file_name)
        with open(file_name, "wb") as outputStream:
            output.write(outputStream)

Using this code it takes about 35 to 55 seconds to split a 177 pages pdf. Is there a way i can improve this code? Is there any other library that is more suitable for this job?

PyPdf2 is a pure-Python library, so… See alternative solutions here: https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 — Laurent LAPORTE, Oct 04 '16 at 21:21
Dany, just a remark: this file name pattern is curious "%s/#*#*#*##-%s.pdf" and not valid under Windows. — Laurent LAPORTE, Oct 05 '16 at 07:26
It's curious, because I have splitted 2135 pages in 40 seconds… What's your OS, PyDF2 version, Python version ? — Laurent LAPORTE, Oct 05 '16 at 07:31
I am using Ubuntu. I am using this pattern at the next step, and I am using regex so it has to be unique. — Montoya, Oct 05 '16 at 07:33
About the pattern, why not using something like: `file_name = os.path.join(directory, u"{page:04d}.pdf".format(page=i))`? — Laurent LAPORTE, Oct 05 '16 at 07:35
Just because there is a less chance regex will find something like *#*#*#.. — Montoya, Oct 05 '16 at 07:37
What is the size of your PDF? Do you have images or mainly text? — Laurent LAPORTE, Oct 05 '16 at 07:37

score 5 · Answer 1 · answered Oct 05 '16 at 08:15

Refactoring

I have refactored the code like this:

import os

import PyPDF2


def split_pdf_pages(input_pdf_path, target_dir, fname_fmt=u"{num_page:04d}.pdf"):
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)

    with open(input_pdf_path, "rb") as input_stream:
        input_pdf = PyPDF2.PdfFileReader(input_stream)

        if input_pdf.flattenedPages is None:
            # flatten the file using getNumPages()
            input_pdf.getNumPages()  # or call input_pdf._flatten()

        for num_page, page in enumerate(input_pdf.flattenedPages):
            output = PyPDF2.PdfFileWriter()
            output.addPage(page)

            file_name = os.path.join(target_dir, fname_fmt.format(num_page=num_page))
            with open(file_name, "wb") as output_stream:
                output.write(output_stream)

note: it's difficult to do better…

Profiling

With this split_pdf_pages function, you can do profiling:

import cProfile
import pstats
import io

pdf_path = "path/to/file.pdf"
directory = os.path.join(os.path.dirname(pdf_path), "pages")

pr = cProfile.Profile()
pr.enable()
split_pdf_pages(pdf_path, directory)
pr.disable()

s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
ps.print_stats()
print(s.getvalue())

Run the profiling with your own PDF file, and analyse the result…

Profiling result

The profiling gave me this result:

         159696614 function calls (155047949 primitive calls) in 57.818 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.899    0.899   57.818   57.818 $HOME/workspace/pypdf2_demo/src/pypdf2_demo/split_pdf_pages.py:14(split_pdf_pages)
     2136    0.501    -.---   53.851    0.025 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:445(write)
103229/96616    1.113    -.---   36.924    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:544(writeToStream)
    27803    9.066    -.---   25.381    0.001 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:445(writeToStream)
4185807/2136    5.054    -.---   14.635    0.007 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:541(_sweepIndirectReferences)
50245/41562    0.117    -.---    9.028    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:1584(getObject)
 31421489    6.898    -.---    8.193    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/utils.py:231(b_)
    56779    2.070    -.---    7.882    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:142(writeToStream)
     8683    0.322    -.---    7.020    0.001 $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/pdf.py:1531(_getObjectFromStream)
459978/20068    1.098    -.---    6.490    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:54(readObject)
26517/19902    0.484    -.---    6.360    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:553(readFromStream)
    27803    3.893    -.---    5.565    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:1162(encode_pdfdocencoding)
 15735379    4.173    -.---    5.412    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/utils.py:268(chr_)
  3617738    2.105    -.---    4.956    -.--- $HOME/virtualenv/py3-pypdf2_demo/lib/site-packages/PyPDF2/generic.py:265(writeToStream)
 18882076    3.856    -.---    3.856    -.--- {method 'write' of '_io.BufferedWriter' objects}

It appears that:

The writeToStream function is heavily called, but I don't know how to optimize this.
The write method directly write to the stream, not in memory => an optimisation is possible.

Improvement

Serialize the PDF page in a buffer (in memory), then write the buffer to the file:

buffer = io.BytesIO()
output.write(buffer)
with open(file_name, "wb") as output_stream:
    output_stream.write(buffer.getvalue())

I processed the 2135 pages in 35 seconds instead of 40.

Poor optimization indeed :-(

The library itself can be optimized: using `__slots__` in `PdfObject` (see: `PyPDF2\generic.py`). We have an optimization of 10% in execution time (and maybe more in memory). — Laurent LAPORTE, Oct 05 '16 at 08:47
Thanks for this answer! I tried this but it only made a little bit faster. Ended up using pdftk as i will explain in my answer. — Montoya, Oct 05 '16 at 11:11

score 3 · Accepted Answer · answered Oct 05 '16 at 11:19

Any optimization couldn't really make a real improvement. I ended up using pdftk. I came across with this page which explains really nice how to split pages fast.

pdftk is a command line tool(and a graphical one) with some very nice options.

Installation:

 sudo apt-get update
 sudo apt-get install pdftk

Usage with python3:

 process = Popen(['pdftk',
                     input_pdf_path,
                     'burst',
                     'output',
                     PdfSplitter.FILE_FORMAT + '%d.pdf'],
                     stdout=PIPE,
                     stderr=PIPE)
 stdout, stderr = process.communicate()

With this tool i managed to split up the 177 pages pdf within 2 seconds.

how do i specify output path? – Koustav Chanda Sep 17 '20 at 12:18 — Koustav Chanda, Sep 17 '20 at 12:18

Python - Split pdf by pages

2 Answers2

Refactoring

Profiling

Profiling result

Improvement

Linked