Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
3
votes
1 answer

Watermark two pdfs - Each page of the first with each page of the second

I have two pdf files of the same length, let's say pdf1.pdf and pdf2.pdf. I'm trying to watermark each page of pdf1.pdf with pdf2.pdf (i.e., page 1 of pdf1.pdf with page 1 of pdf2.pdf, page 2 of pdf1.pdf with page 2 of pdf2.pdf ...). However, I'm…
Fabian
  • 51
  • 1
  • 10
3
votes
0 answers

Extract text from pdf ignoring cropped content

I'm trying to extract text from a pdf file that has been cropped. I.e it has a defined cropbox which only displays a portion of the page. The problem is that the cropped part still exists in pdf files, its just not visible. I've tried PyPDF2,…
doddy
  • 579
  • 5
  • 18
3
votes
1 answer

Issue with PyPDF2 and decoding pdf file from S3

I am trying to get a pdf file stored in one of my S3 buckets in AWS, and get some of its metadata like number of pages, and file size. I successfully get the pdf file from the S3 bucket, getting this when calling…
TJB
  • 3,706
  • 9
  • 51
  • 102
3
votes
2 answers

How can I extract the TOC with PyPDF2?

Take this pdf as an example. I can extrac the table of contents (TOC) with dumppdf.py -T 1707.09725.pdf:
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
3
votes
1 answer

Adding blank page to odd-paged PDF in Python

This is a rewrite of How to insert a "missing" page as blank page in PDF with Python? but I am trying to do it with the PdfFileWriter additional methods: cloneDocumentFromReader() and addBlankPage(), because it seemed cleaner this way. I need to add…
Gnudiff
  • 4,297
  • 1
  • 24
  • 25
3
votes
0 answers

Manually adding a new page to merged PDF offsets original bookmark destinations

The code below works great on what I'm trying to initially accomplish. However, if I try to manually add a new first page and bookmark to that PDF the existing bookmark destinations move back one page and are not linked to where they originally…
theurlin
  • 181
  • 1
  • 1
  • 8
3
votes
1 answer

Finding text whether it is highlighted or not

I am currently trying to use PyPDF2 to read the PDF file in the Python.I want to know whether the text of the PDF file is highlighted or not. Context: We use to highlight text in PDF file with a different color.Is there any way to know which text…
ankyAS
  • 301
  • 2
  • 11
3
votes
3 answers

what causes "insufficient data for image" in a pdf

I have a program in Python (using pyPDF) that merges a bunch of different PDF documents. Sometimes, the resulting pdf is fine, except for some blank pages in the middle. When I view these documents with Acrobat Reader, I get an error message…
Chris Curvey
  • 9,738
  • 10
  • 48
  • 70
3
votes
2 answers

PyPDF2 not printing any output from the text

I am trying to print text from pdf using PyPDF2. Here is my code: import PyPDF2 pdf_file = open('report.pdf', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(1) page_content =…
muazfaiz
  • 4,611
  • 14
  • 50
  • 88
3
votes
1 answer

Image drawn to reportlab pdf bigger than pdf paper size

i'm writing a program which takes all the pictures in a given folder and aggregates them into a pdf. The problem I have is that when the images are drawn, they are bigger in size and are rotated to the left oddly. I've searched everywhere, havent…
opeonikute
  • 494
  • 1
  • 4
  • 15
3
votes
4 answers

How do I apply my python code to all of the files in a folder at once, and how do I create a new name for each subsequent output file?

The code I am working with takes in a .pdf file, and outputs a .txt file. My question is, how do I create a loop (probably a for loop) which runs the code over and over again on all files in a folder which end in ".pdf"? Furthermore, how do I change…
Jack Bunce
  • 43
  • 1
  • 3
3
votes
1 answer

How to convert Pdf to Text with Unicode (utf-8) format using PyPdf

How can I covert Pdf to Text file in Unicode (utf-8) format using PyPdf in Python? # finally, write "output" to document-output.pdf outputStream = file(("document-output.txt", "wb") output.write(outputStream) outputStream.close()
Htet
  • 159
  • 10
3
votes
1 answer

Porting to Python3: PyPDF2 mergePage() gives TypeError

I'm using Python 3.4.2 and PyPDF2 1.24 (also using reportlab 3.1.44 in case that helps) on windows 7. I recently upgraded from Python 2.7 to 3.4, and am in the process of porting my code. This code is used to create a blank pdf page with links…
H0L0GH05t
  • 91
  • 1
  • 7
3
votes
2 answers

Working with a pdf from the web directly in Python?

I'm trying to use Python to read .pdf files from the web directly rather than save them all to my computer. All I need is the text from the .pdf and I'm going to be reading a lot (~60k) of them, so I'd prefer to not actually have to save them…
Luigi
  • 4,129
  • 6
  • 37
  • 57
3
votes
1 answer

How to calculate bounding box using PyPDF2 in Python 3

This question relates to PyPDF2 used with Python 3 ghostscript apparently is able to effectively calculate the bounding box of the content within a PDF page as follows: gs -dBATCH -dSAFER -dNOPAUSE -sDEVICE=bbox document1.pdf The result returned in…
Duke Dougal
  • 24,359
  • 31
  • 91
  • 123