Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
7
votes
1 answer

PyPDF2 insists on removing all the spaces

I have read a number of other stackoverflow answers and have yet to find a satisfactory answer to this, but it has been asked before. When I attempt to use PyPDF2 to read pdf documents it merges all of the words in a sentences into one continous…
Steve
  • 4,388
  • 3
  • 17
  • 25
7
votes
2 answers

PyPDF2 won't extract all text from PDF

I'm trying to extract text from a PDF (https://www.sec.gov/litigation/admin/2015/34-76574.pdf) using PyPDF2, and the only result I'm getting is the following string: b'' Here is my code: import PyPDF2 import urllib.request import io url =…
Al_C91
  • 71
  • 1
  • 1
  • 2
7
votes
1 answer

Identifying Bold Text in PDF using pyPdf

I am using pyPdf to extract text from a PDF. I would like to be able to know which text is bold in order to identify bold section headers. How can I identify bold text?
Michael
  • 13,244
  • 23
  • 67
  • 115
7
votes
3 answers

PyPDF2 won't import

Hi I'm just getting started with python and trying to get some requisite libraries installed. Using Python 3.4.1 on OS X. I have installed PyPDF2 (with supposed success), yet I cannot seem to use the tools: sh-3.2# port select --list python …
BrentL
  • 73
  • 1
  • 1
  • 5
7
votes
4 answers

PyPDF2 compression

I am struggling to compress my merged pdf's using the PyPDF2 module. this is my attempt based on http://www.blog.pythonlibrary.org/2012/07/11/pypdf2-the-new-fork-of-pypdf/ import PyPDF2 path = open('path/to/hello.pdf', 'rb') path2 =…
nagordon
  • 1,307
  • 2
  • 13
  • 16
7
votes
2 answers

Detect and alter strings in PDFs

I want to be able to detect a pattern in a PDF and somehow flag it. For instance, in this PDF, there's the string *2. I want to be able to parse the PDF, detect all instances of *[integer], and do something to call attention to the matches (like…
Joe Mornin
  • 8,766
  • 18
  • 57
  • 82
7
votes
5 answers

finding on which page a search string is located in a pdf document using python

Which python packages can I use to find out out on which page a specific “search string” is located ? I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and…
user1043144
  • 2,680
  • 5
  • 29
  • 45
6
votes
3 answers

PyPDF4 - Exported PDF file size too big

I have a PDF file of around 7000 pages and 479 MB. I have create a python script using PyPDF4 to extract only specific pages if the pages contain specific words. The script works but the new PDF file, even though it has only 650 pages from the…
6
votes
2 answers

Detect and crop a box in .pdf or image as individual images

I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I…
Steve
  • 135
  • 1
  • 10
6
votes
1 answer

Getting TypeError: ord() expected string of length 1, but int found error

Code is from PyPDF2 import PdfFileReader with open('HTTP_Book.pdf','rb') as file: pdf=PdfFileReader(file) pagedd=pdf.getPage(0) print(pagedd.extractText()) This code raises the error shown below: TypeError: ord() expected string of…
Jeet Singh
  • 303
  • 1
  • 2
  • 10
6
votes
1 answer

Duplicating PDF with PyPDF2 gives blank pages

I'm using PyPDF2 to alter a PDF document (adding bookmarks). So I need to read in the entire source PDF, and write it out, keeping as much of the data intact as possible. Merely writing each page into a new PDF object may not be sufficient to…
benwiggy
  • 1,440
  • 17
  • 35
6
votes
3 answers

Extract pdf text within bounding box directly into python

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract…
Evan Mata
  • 500
  • 1
  • 6
  • 19
6
votes
1 answer

How to extract images and image BBox coordinates using python?

I am trying to extract images in PDF with BBox coordinates of the image. I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coordinates, i am not sure if those are correct bbox…
Satyaaditya
  • 537
  • 8
  • 26
6
votes
1 answer

PyPDF2 to extract vertical text from scanned pdf

I am trying to extract text from the scanned pdf using PyPDF2. Some of the pdf contains text aligned vertically. But the orientation of the page is Portrait. Is there any way to identify if the text is vertically aligned and read vertical lines in…
Mms
  • 91
  • 4
6
votes
1 answer

Why does PyPDF2.PdfFileWriter forget changes I made to a document?

I am trying to modify text in a PDF file. The text can be in an object of type Tj or BDC. I find the correct objects and if I read them directly after changing them they show the updated values. But if I pass the complete page to PdfFileWriter the…
Joe
  • 6,758
  • 2
  • 26
  • 47