Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
-1
votes
1 answer

IndexError: list index out of range in pypdf2 extract_text in specific pdf file

I have tried: from PyPDF2 import PdfReader input_pdf = PdfReader(open("pdfFile.pdf", "rb")) thispage = input_pdf.pages[0] print(thispage.extract_text()) And I got the following error: Traceback (most recent call last): File…
-1
votes
2 answers

Check if two sentences contain any matching word using Python

I'm trying to simply check whether two sentences have any similar words. Here's an example: string_one = "Author: James Oliver" string_two = "James Oliver has written this beautiful article which says...." In this case, these two sentences match…
saran3h
  • 12,353
  • 4
  • 42
  • 54
-1
votes
2 answers

pypdf gives output with incorrect PDF format

I am using the following code to resize pages in a PDF: from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize from pypdf.generic import RectangleObject reader = PdfReader("input.pdf") writer = PdfWriter() for page in…
Zain Khaishagi
  • 135
  • 1
  • 9
-1
votes
1 answer

Python PDFMerger Too Slow

I am using PDFMerger from PyPDF2. My program is basically reading all PDFs in a folder and merges them into a single one. I have made a test with 15 PDF files each is 99kb and it worked like a charm. Whole process was finished within a second.…
seneill
  • 63
  • 7
-1
votes
1 answer

Split pfd based off value and Merge dictionaries inside list in python

I want to split a pdf based off a value on every page. Every value should be in its own pdf file. I currently have the following list where all values with the pages are displayed: l = [ {'abr': '123 ', 'page': 1}, {'abr': '125 ', 'page':…
-1
votes
2 answers

Having trouble getting all the page numbers from a pdf file to output

I'm having trouble getting all the page numbers from a pdf file. this is my code! I just get a one-page number that outputs I'm trying to get all the page numbers from my pdf file. How would I fix my code to get all the pdf page numbers? In total…
George
  • 1
  • 1
-1
votes
1 answer

How can I merge mutiple pdf-files to one?

from tkinter import filedialog as fd import tkinter as tk from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger import os mother = tk.Tk() base_pdf = fd.askopenfilename(filetypes=[('PDF files', '.pdf')], title='Wählen Sie bitte die…
POPZMOKE
  • 3
  • 2
-1
votes
2 answers

Getting none from fields while parsing a pdf file

I am trying to parse a pdf file. I want to get all the values in a list or dictionary of the checkbox values. But I am getting this error. "return OrderedDict((k, v.get('/V', '')) for k, v in fields.items()) AttributeError: 'NoneType' object has no…
saxope
  • 11
  • 3
-1
votes
1 answer

Extract Text from PDF using Python

Hi I am a python beginner. I am trying to extract text from only few boxes in a pdf file PDF File Link I used pytesseract library to extract the text but it is downloading all the text. I want to limit my text extraction to certain observations in…
-1
votes
1 answer

PyPDF2 find coordinates of Objects

is ther anyway i can find Coordinates in Python from Objects of the PDF. I want then to Cut the PDf exact above the highest Object and below the lowest Object: from PyPDF2 import PdfFileWriter, PdfFileReader with open("in.pdf", "rb") as…
Jayklops
  • 1
  • 2
-1
votes
2 answers

PDF Parsing a sentence across multiple Lines

Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines). I am able to print() the line the phrase appears in. Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from…
StressedBoi69420
  • 1,376
  • 1
  • 12
  • 40
-1
votes
1 answer

Loop through folder and subfolders and merge pdf

I tried to create a script to loop through parent folder and subfolders and merge all of the pdfs into one. Below if the code I wrote so far, but I don't know how to combine them into one script. Reference: Merge PDF files The first function is to…
Brian C.
  • 3
  • 3
-1
votes
1 answer

I cannot find a way to extract underlined text, cant it be done with pdfminer.six?

I am trying to extract a text in pdf which is underlined using python but not able to find a correct solution can anyone help on this, please
-1
votes
1 answer

Split PDF into 10 page sets (python)

I need to split a roughly 380 page pdf file into sets of 10 pages using python. My initial thoughts are to use PyPDF2 but I have no experience with it. I do need a mechanism to ensure the final PDF is saved despite it being under 10 pages. (eg. 383…
Sam Oberly
  • 13
  • 2
-1
votes
1 answer

Module not found when I tried to import pyPDF2

My python version is 3.6. I am able to install the pyPDF2. Ran pip install pyPDF2 successfully. Ran pip list, it shows up as 1.26.0 My environment is not base, but I set up an environment as pytorch. pyPDF2 is installed successfully in this…
Meng Ge
  • 21
  • 1
  • 6