Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
8
votes
2 answers

Concatenating PDF files in memory with PyPDF2

I wish to concatenate (append) a bunch of small pdfs together effectively in memory in pure python. Specifically, an usual case is 500 single page pdfs, each with a size of about 400 kB, to be merged into one. Let's say the pdfs are available as a…
Andreas
  • 85
  • 1
  • 7
8
votes
1 answer

Check if page is vertical using PyPDF2?

Is there a way to check to see if a PDF page is vertical using PyPDF2? Ideally, there would be method pdfReader.getPage(0).isVertical() that returns true or false, but I can't find anything in the PageObject docs I am attempting to merge a…
Henry
  • 564
  • 3
  • 22
8
votes
3 answers

How can I rotate a page with PyPDF2?

I'm editing a PDF file with pyPDF2. I managed to generate the PDF I want but I've yet to rotate some pages. I went to the documentation and found two methods: rotateClockwise and rotateCounterClockwise, and while they say the parameter is an int, I…
8
votes
4 answers

PyPDF2 PdfFileWriter has no attribute stream

I am trying to split a pdf into its pages and save each page as a new pdf. I have tried this method from a previous question with no success and the pypdf2 split example from here with no success. EDIT: I can see in my files that it does…
pope
  • 81
  • 1
  • 1
  • 3
8
votes
2 answers

How to check / uncheck checkboxes in a PDF with Python (preferably PyPDF2)?

I have the code below from PyPDF2 import PdfFileReader, PdfFileWriter d = { "Name": "James", " Date": "1/1/2016", "City": "Wilmo", "County": "United States" } reader = PdfFileReader("medicareRRF.pdf") inFields =…
8
votes
6 answers

split a pdf based on outline

i would like to use pyPdf to split a pdf file based on the outline where each destination in the outline refers to a different page within the pdf. example outline: main --> points to page 1 sect1 --> points to page 1 sect2 -->…
darrell
  • 191
  • 1
  • 5
8
votes
3 answers

pyPdf ignores newlines in PDF file

I'm trying to extract each page of a PDF as a string: import pyPdf pages = [] pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb')) for i in range(0, pdf.getNumPages()): this_page = pdf.getPage(i).extractText() + "\n" this_page = "…
Joe Mornin
  • 8,766
  • 18
  • 57
  • 82
7
votes
2 answers

How do I extract all of the text from a PDF using indexing

I am new to Python and coding in general. I'm trying to create a program that will OCR a directory of PDFs then extract the text so I can later pick out specific things. However, I am having trouble getting pdfPlumber to extract all the text from…
Ryan Adams
  • 77
  • 1
  • 1
  • 4
7
votes
4 answers

Fast PDF splitter library

pyPdf is a great library to split, merge PDF files. I'm using it to split pdf documents into 1 page documents. pyPdf is pure python and spends quite a lot of time in the _sweepIndirectReferences() method of the PdfFileWriter object when saving the…
Nathan
  • 2,955
  • 1
  • 19
  • 17
7
votes
1 answer

Edit text in PDF with python

I have a pdf file and I need to edit some text/values in the pdf. For example, In the pdf files that I have BIRTHDAY DD/MM/YYYY is always N/A. I want to change it to whatever value I desire and then save it as a new document. Overwriting existing…
rootkit
  • 353
  • 1
  • 3
  • 15
7
votes
3 answers

Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first

I am using Linux; printing raw to port 9100 returns a "bytes" type. I was wondering if it is possible to go from this straight into PyPDF2, rather than make a pdf file first and using method PdfFileReader? Thank you for your time.
TheSadPrinter
  • 359
  • 1
  • 4
  • 15
7
votes
4 answers

Extracting text from pdf using Python and Pypdf2

I want to extract text from pdf file using Python and PYPDF package. This is my pdf fie and this is my code: import PyPDF2 opened_pdf = PyPDF2.PdfFileReader('test.pdf', 'rb') p=opened_pdf.getPage(0) p_text= p.extractText() # extract data line by…
Amir
  • 625
  • 1
  • 11
  • 26
7
votes
6 answers

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here: http://www.4shared.com/document/kmJF67E4/forms.html If you run the following, the first 81…
DrJAKing
  • 71
  • 1
  • 1
  • 2
7
votes
2 answers

Watermark Removal on PDF with PyPDF2

# This Section imports the necessary classes from the PyPDF2 library from PyPDF2 import PdfFileReader, PdfFileWriter from PyPDF2.generic import ContentStream, NameObject, TextStringObject from PyPDF2.utils import b_ # The watermark says SAMPLE on…
Shane G.
  • 71
  • 1
  • 1
  • 4
7
votes
4 answers

How to open a generated PDF file in browser?

I have written a Pdf merger which merges an original file with a watermark. What I want to do now is to open 'document-output.pdf' file in the browser by a Django view. I already checked Django's related articles, but since my approach is relatively…
israkir
  • 2,111
  • 7
  • 30
  • 39