6

I wanna split pdf file using PyPDF2.

All examples in net is too difficult or don't work or always give error "AttributeError: 'PdfFileWriter' object has no attribute 'stream'"

Can someone help with it ? Need separete one pdf with 3 pages into three different files.

I'm starting from that:

pdfFileObj = open(r"D:\BPO\act.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(pdfReader.getPage(0))

But don't know what to do next :(

EDIT#1

Was try do a loop for spliting and i'm have a problem: PdfFileWriter make 3 files one with one page, second - with two, and third with three. Where is my mistake in following code:

act_sub_pages_name = ['p01.pdf', 'p02.pdf', 'p03.pdf']
with open(r"D:\BPO\act.pdf", 'rb') as act_mls:
    reader = PdfFileReader(act_mls)
    writer = PdfFileWriter()
    if reader.numPages == 3:
        counter = 0
        for x in range(3):
            path = '\\'.join(['D:\\BPO\\act sub pages', act_sub_pages_name[counter]])
            counter += 1
            writer.addPage(reader.getPage(x))
            with open(path, 'wb') as outfile: writer.write(outfile)

Sry for bad English.

EDIT#2

My solution according by Paul Rooney answer:

act_pdf_file = 'D:\\BPO\\act.pdf'
act_sub_pages_name = ['p01.pdf', 'p02.pdf', 'p03.pdf']

def pdf_splitter(index, src_file):
    with open(src_file, 'rb') as act_mls:
        reader = PdfFileReader(act_mls)
        writer = PdfFileWriter()
        writer.addPage(reader.getPage(index))
        out_file = os.path.join('D:\\BPO\\act sub pages', act_sub_pages_name[index])
        with open(out_file, 'wb') as out_pdf: writer.write(out_pdf)

for x in range(3): pdf_splitter(x, act_pdf_file)

With function all works properly but it a little bit harder.

Acamori
  • 327
  • 1
  • 5
  • 15

3 Answers3

27

You can use the write method of the PdfFileWriter to write out to the file.

from PyPDF2 import PdfFileReader, PdfFileWriter

with open("input.pdf", 'rb') as infile:

    reader = PdfFileReader(infile)
    writer = PdfFileWriter()
    writer.addPage(reader.getPage(0))

    with open('output.pdf', 'wb') as outfile:
        writer.write(outfile)

You may want to loop over the pages of the input file, create a new writer object, add a single page. Then write out to an ever incrementing filename or have some other scheme for deciding output filename?

Paul Rooney
  • 20,879
  • 9
  • 40
  • 61
  • Yes, after splitting i need a new name for each file like as act1, act2, act3. – Acamori Jul 17 '17 at 13:28
  • I am getting this error in the write call. PdfReadError: Unable to find 'endstream' marker after stream at byte 0x16f35. – Rishabh Gupta Sep 19 '22 at 16:45
  • Could be a fault in your file. Did you try both libraries suggested here? Also can popular pdf viewers open your file? Look here https://github.com/py-pdf/PyPDF2/issues/301 hope it helps – Paul Rooney Sep 20 '22 at 09:32
1

I've used a tool called xpdf for just this sort of task and it works really really well. You can download it here.

It's a command line utility that you can call from python. Make sure it's added to your path so you can call it from the command line.

Here's how you can interface it from python, using subprocess:

import subprocess

text, _ = subprocess.Popen('pdftotext -fixed 0 -clip D:\\BPO\\act.pdf', 
                           shell=True, 
                           stdout=subprocess.PIPE).communicate()

pages = text.decode('latin-1').split('\f')

Pages are separated by formfeed characters, so you'll get a list of pages.

cs95
  • 379,657
  • 97
  • 704
  • 746
1

Update 2023:

@Paul Rooney's answer as it is didn't work for me with the updated PyPDF2 and they changed the built-in functions. Below is the updated code:

from PyPDF2 import PdfReader, PdfWriter

with open("input.pdf", 'rb') as infile:
    reader = PdfReader(infile)
    page = 0
    writer = PdfWriter()
    total_pages = len(reader.pages)
    while page<tp:
        writer.add_page(reader.pages[page])
        if page == int(total_pages/3) or page==total_pages-1:
            with open("output-{}.pdf".format(page), 'wb') as outfile:
                writer.write(outfile)
                writer = PdfWriter()
        page+=1

        
        
user3503711
  • 1,623
  • 1
  • 21
  • 32