Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

extracting document information (title, author, ...),
splitting documents page by page,
merging documents page by page,
cropping pages,
merging multiple pages into a single page,
encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions

votes

1 answer

Watermark two pdfs - Each page of the first with each page of the second

I have two pdf files of the same length, let's say pdf1.pdf and pdf2.pdf. I'm trying to watermark each page of pdf1.pdf with pdf2.pdf (i.e., page 1 of pdf1.pdf with page 1 of pdf2.pdf, page 2 of pdf1.pdf with page 2 of pdf2.pdf ...). However, I'm…

asked Apr 28 '18 at 11:21

Fabian

votes

0 answers

Extract text from pdf ignoring cropped content

I'm trying to extract text from a pdf file that has been cropped. I.e it has a defined cropbox which only displays a portion of the page. The problem is that the cropped part still exists in pdf files, its just not visible. I've tried PyPDF2,…

python pdf pdfbox pypdf pdfminer

asked Mar 13 '18 at 00:34

doddy

votes

1 answer

Issue with PyPDF2 and decoding pdf file from S3

I am trying to get a pdf file stored in one of my S3 buckets in AWS, and get some of its metadata like number of pages, and file size. I successfully get the pdf file from the S3 bucket, getting this when calling…

python amazon-web-services pdf amazon-s3 pypdf

asked Jan 22 '18 at 02:37

TJB

3,706
9
51
102

votes

2 answers

How can I extract the TOC with PyPDF2?

Take this pdf as an example. I can extrac the table of contents (TOC) with dumppdf.py -T 1707.09725.pdf: …

pdf pypdf

asked Jan 08 '18 at 19:53

Martin Thoma

124,992
159
614
958

votes

1 answer

Adding blank page to odd-paged PDF in Python

This is a rewrite of How to insert a "missing" page as blank page in PDF with Python? but I am trying to do it with the PdfFileWriter additional methods: cloneDocumentFromReader() and addBlankPage(), because it seemed cleaner this way. I need to add…

python pypdf

asked Oct 04 '17 at 15:53

Gnudiff

4,297
1
24
25

votes

0 answers

Manually adding a new page to merged PDF offsets original bookmark destinations

The code below works great on what I'm trying to initially accomplish. However, if I try to manually add a new first page and bookmark to that PDF the existing bookmark destinations move back one page and are not linked to where they originally…

python pypdf

asked Apr 06 '17 at 13:02

theurlin

votes

1 answer

Finding text whether it is highlighted or not

I am currently trying to use PyPDF2 to read the PDF file in the Python.I want to know whether the text of the PDF file is highlighted or not. Context: We use to highlight text in PDF file with a different color.Is there any way to know which text…

python pdf-generation pypdf

asked Aug 09 '16 at 10:10

ankyAS

votes

3 answers

what causes "insufficient data for image" in a pdf

I have a program in Python (using pyPDF) that merges a bunch of different PDF documents. Sometimes, the resulting pdf is fine, except for some blank pages in the middle. When I view these documents with Acrobat Reader, I get an error message…

python pdf-generation pypdf

asked Oct 04 '10 at 18:03

Chris Curvey

9,738
10
48
70

votes

2 answers

PyPDF2 not printing any output from the text

I am trying to print text from pdf using PyPDF2. Here is my code: import PyPDF2 pdf_file = open('report.pdf', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(1) page_content =…

python-3.x pypdf

asked Jul 25 '16 at 12:47

muazfaiz

4,611
14
50
88

votes

1 answer

Image drawn to reportlab pdf bigger than pdf paper size

i'm writing a program which takes all the pictures in a given folder and aggregates them into a pdf. The problem I have is that when the images are drawn, they are bigger in size and are rotated to the left oddly. I've searched everywhere, havent…

python pdf reportlab pypdf

asked Mar 09 '16 at 15:38

opeonikute

votes

4 answers

How do I apply my python code to all of the files in a folder at once, and how do I create a new name for each subsequent output file?

The code I am working with takes in a .pdf file, and outputs a .txt file. My question is, how do I create a loop (probably a for loop) which runs the code over and over again on all files in a folder which end in ".pdf"? Furthermore, how do I change…

python parsing for-loop naming pypdf

asked Jul 21 '15 at 17:39

Jack Bunce

votes

1 answer

How to convert Pdf to Text with Unicode (utf-8) format using PyPdf

How can I covert Pdf to Text file in Unicode (utf-8) format using PyPdf in Python? # finally, write "output" to document-output.pdf outputStream = file(("document-output.txt", "wb") output.write(outputStream) outputStream.close()

python pdf unicode utf-8 pypdf

asked Jan 26 '15 at 02:12

Htet

votes

1 answer

Porting to Python3: PyPDF2 mergePage() gives TypeError

I'm using Python 3.4.2 and PyPDF2 1.24 (also using reportlab 3.1.44 in case that helps) on windows 7. I recently upgraded from Python 2.7 to 3.4, and am in the process of porting my code. This code is used to create a blank pdf page with links…

python-3.4 porting reportlab pypdf

asked Jan 15 '15 at 22:45

H0L0GH05t

votes

2 answers

Working with a pdf from the web directly in Python?

I'm trying to use Python to read .pdf files from the web directly rather than save them all to my computer. All I need is the text from the .pdf and I'm going to be reading a lot (~60k) of them, so I'd prefer to not actually have to save them…

python pdf urllib pypdf

asked Apr 18 '14 at 03:33

Luigi

4,129
6
37
57

votes

1 answer

How to calculate bounding box using PyPDF2 in Python 3

This question relates to PyPDF2 used with Python 3 ghostscript apparently is able to effectively calculate the bounding box of the content within a PDF page as follows: gs -dBATCH -dSAFER -dNOPAUSE -sDEVICE=bbox document1.pdf The result returned in…

python python-3.x pdf pypdf

asked Mar 04 '14 at 13:10

Duke Dougal

24,359
31
91
123

Prev 1 2 3

…

96 97 Next