Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

extracting document information (title, author, ...),
splitting documents page by page,
merging documents page by page,
cropping pages,
merging multiple pages into a single page,
encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions

votes

1 answer

PyPDF2 insists on removing all the spaces

I have read a number of other stackoverflow answers and have yet to find a satisfactory answer to this, but it has been asked before. When I attempt to use PyPDF2 to read pdf documents it merges all of the words in a sentences into one continous…

python pypdf

asked Apr 28 '16 at 12:11

Steve

4,388
3
17
25

votes

2 answers

PyPDF2 won't extract all text from PDF

I'm trying to extract text from a PDF (https://www.sec.gov/litigation/admin/2015/34-76574.pdf) using PyPDF2, and the only result I'm getting is the following string: b'' Here is my code: import PyPDF2 import urllib.request import io url =…

python python-3.x pdf pypdf

asked Jan 29 '16 at 17:53

Al_C91

votes

1 answer

Identifying Bold Text in PDF using pyPdf

I am using pyPdf to extract text from a PDF. I would like to be able to know which text is bold in order to identify bold section headers. How can I identify bold text?

python pypdf

asked Sep 04 '14 at 00:10

Michael

13,244
23
67
115

votes

3 answers

PyPDF2 won't import

Hi I'm just getting started with python and trying to get some requisite libraries installed. Using Python 3.4.1 on OS X. I have installed PyPDF2 (with supposed success), yet I cannot seem to use the tools: sh-3.2# port select --list python …

python installation import pypdf

asked Aug 10 '14 at 00:02

BrentL

votes

4 answers

PyPDF2 compression

I am struggling to compress my merged pdf's using the PyPDF2 module. this is my attempt based on http://www.blog.pythonlibrary.org/2012/07/11/pypdf2-the-new-fork-of-pypdf/ import PyPDF2 path = open('path/to/hello.pdf', 'rb') path2 =…

python pdf pypdf

asked Apr 01 '14 at 03:42

nagordon

1,307
2
13
16

votes

2 answers

Detect and alter strings in PDFs

I want to be able to detect a pattern in a PDF and somehow flag it. For instance, in this PDF, there's the string *2. I want to be able to parse the PDF, detect all instances of *[integer], and do something to call attention to the matches (like…

python regex perl pdf pypdf

asked Oct 16 '13 at 22:04

Joe Mornin

8,766
18
57
82

votes

5 answers

finding on which page a search string is located in a pdf document using python

Which python packages can I use to find out out on which page a specific “search string” is located ? I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and…

python pdf pypdf

asked Sep 24 '12 at 19:50

user1043144

2,680
5
29
45

votes

3 answers

PyPDF4 - Exported PDF file size too big

I have a PDF file of around 7000 pages and 479 MB. I have create a python script using PyPDF4 to extract only specific pages if the pages contain specific words. The script works but the new PDF file, even though it has only 650 pages from the…

python python-3.x pdf pypdf

asked Jan 06 '20 at 14:40

Mihail-Cosmin Munteanu

votes

2 answers

Detect and crop a box in .pdf or image as individual images

I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I…

python opencv image-processing computer-vision pypdf

asked Jul 17 '19 at 04:30

Steve

votes

1 answer

Getting TypeError: ord() expected string of length 1, but int found error

Code is from PyPDF2 import PdfFileReader with open('HTTP_Book.pdf','rb') as file: pdf=PdfFileReader(file) pagedd=pdf.getPage(0) print(pagedd.extractText()) This code raises the error shown below: TypeError: ord() expected string of…

python python-3.x pypdf

asked May 05 '19 at 16:02

Jeet Singh

votes

1 answer

Duplicating PDF with PyPDF2 gives blank pages

I'm using PyPDF2 to alter a PDF document (adding bookmarks). So I need to read in the entire source PDF, and write it out, keeping as much of the data intact as possible. Merely writing each page into a new PDF object may not be sufficient to…

python pdf pypdf

asked Apr 21 '19 at 17:09

benwiggy

1,440
17
35

votes

3 answers

Extract pdf text within bounding box directly into python

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract…

python pdf text-extraction pypdf pdfminer

asked Apr 09 '19 at 00:26

Evan Mata

votes

1 answer

How to extract images and image BBox coordinates using python?

I am trying to extract images in PDF with BBox coordinates of the image. I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coordinates, i am not sure if those are correct bbox…

python pypdf pdf-extraction pdfrw

asked Feb 06 '19 at 06:41

Satyaaditya

votes

1 answer

PyPDF2 to extract vertical text from scanned pdf

I am trying to extract text from the scanned pdf using PyPDF2. Some of the pdf contains text aligned vertically. But the orientation of the page is Portrait. Is there any way to identify if the text is vertically aligned and read vertical lines in…

python python-3.x pypdf pdfminer pdf-extraction

asked Sep 27 '18 at 05:53

Mms

votes

1 answer

Why does PyPDF2.PdfFileWriter forget changes I made to a document?

I am trying to modify text in a PDF file. The text can be in an object of type Tj or BDC. I find the correct objects and if I read them directly after changing them they show the updated values. But if I pass the complete page to PdfFileWriter the…

python python-3.x pdf pdf-generation pypdf

asked Sep 25 '18 at 13:25

Joe

6,758
2
26
47

Prev 1 2 3

…

96 97 Next