Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
0
votes
1 answer

Python pyPdf issue downloading pdf

I'm having a hard time reading a pdf from the internet into the python PdfFileReader object. My code works for the first url, but it doesn't for the second and I don't know how to fix it. I can see that in the first example, the url refers to a…
Bosiwow
  • 2,025
  • 3
  • 28
  • 46
0
votes
0 answers

PDF to txt conversion: why does txt.write() not work?

What I am trying to do here is convert a pdf to a text file. This txt is not a pre-existing one, but it is created with creaty. The problem is that although writy.write() has worked fine in other scripts, it won't do anything to change the writy…
0
votes
1 answer

PyPDF2 merging issue from file list

I'm getting some weird output files from trying to merge a couple of PDF files using pandas and PyPDF2. I have a single page PDF (certificate) I need to merge with a two page document which is common to all. Then name the resulting output file for…
James
  • 31
  • 2
0
votes
1 answer

Unhashable type: 'Indirect Object' in PyPDF2?

I've used PyPDF2 successfully with other PDF's without a hitch but when trying to work with this current one I can't retrieve the pages without getting this error. Specifically, this is on the mergePage method. It must be something specific with the…
0
votes
2 answers

Replace a Specific page in a PDF with a page from another PDF in python 3

I am using pypdf2 to highlight text in a particular page in Pdf files.Hence,I get only a single page with higlighted text as an output.Now,I want to replace this page in the original pdf file. I have also tried "search=" parameter from abode to…
0
votes
1 answer

Image extraction not working if PDF contains 8-10 or more pages

I am trying to extract images from PDF and got a code from StackOverflow. It is working fine for some of the pdf but not for all. I saw a pattern that pdf which has a number of pages more than 8-10, it is not extracting anything.I think I am missing…
0
votes
2 answers

Count Images in a pdf document through python

Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
Hayat
  • 1,539
  • 4
  • 18
  • 32
0
votes
0 answers

Extracting transactional data from PDF row wise by using PyPDF2

I am trying to extract the transactional data from the PDF as a simple program using Python 3. What I am seeing is the output is returning as garbage text from page 1. This is happening to the specific bank statement pdf whereas other PDF works…
Sujit
  • 468
  • 1
  • 8
  • 22
0
votes
0 answers

How to parse text extracted from PDF file with delimiter using Python?

I have tried PyPDF2 to extract and parse text from PDF using following code segment; import PyPDF2 import re pdfFileObj = open('test.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) rawText = pdfReader.getPage().extractText() extractedText…
0
votes
1 answer

Appending pdf files based multilpe values in a dictionary key (or csv) results in too many pages

I am trying generate pdf files based on the county they fall in. If there is more than one pdf file per county then I need to append the files into a single file based on the county key. I can't seem to get the maps to append based on key. The final…
NewbieX
  • 19
  • 7
0
votes
1 answer

Pdf Imposition using Python

I am trying to have the first page and second page of the pdf imposed on to page 1. The first page will be above the second page, imposed on the first page. The issue is the pages are not triming, or merging. The last page imposes on the second to…
0
votes
1 answer

can only concatenate list (not "unicode") to list

I have copy pasted some Lorem Ipsum in a Word.docx file, saved it as PDF and tried to run the following script for testing purposes to extract text from a PDF. from pyPdf import PdfFileReader if (fileExtension == ".PDF"): pdfDoc =…
PRIME
  • 73
  • 1
  • 3
  • 10
0
votes
1 answer

Python Script to Iterate through PDF's in a directory and find a matching line

Currently i get all my reports delivered to me via email attached as a pdf. What i have done is set outlook to automatically download those files to a certain directory every day. Sometimes those pdfs dont have any data in them and only contain the…
user3487244
  • 127
  • 2
  • 4
  • 11
0
votes
1 answer

pyPdf Splitting Large PDF fails after splitting 150-152 pages of the PDF

I have a function that takes in PDF file path as input and splits it into separate pages as shown below: import os,time from pyPdf import PdfFileReader, PdfFileWriter def split_pages(file_path): print("Splitting the PDF") temp_path =…
Vishnu Y S
  • 183
  • 6
  • 18
0
votes
1 answer

Reading pdf remotely using urllib2

I am trying to extract text from pdf remotely. The url is this http://loc.gov/aba/publications/FreeLCC/A-text.pdf My code is as follows import urllib2 import PyPDF2 import io URL = 'http://loc.gov/aba/publications/FreeLCC/A-outline.pdf' remote_file…
Echchama Nayak
  • 971
  • 3
  • 23
  • 44