Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
3
votes
5 answers

Merging PDF's with python pypdf and deleting merged files

I'm trying to write a program in python that takes a PDF file and appends to it first any pdf which includes the name of a fruit to it(Mango, Orange or Apple), then appends the pdf's with the names of animals to the original file(Zebra, Monkey, Dog)…
user2617248
  • 329
  • 1
  • 5
  • 9
3
votes
4 answers

How to install a module for python 2.6 on CentOS?

After I install python 2.6 on CentOS by: wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm sudo rpm -ivh epel-release-5-4.noarch.rpm yum install python26 Then I install pyPdf by: yum install pyPdf However, the…
ohho
  • 50,879
  • 75
  • 256
  • 383
3
votes
2 answers

merging PDFs in pyPDF on larger canvas

What I am looking to do in pyPDF is create a script that will generate a 17x11 PDF "canvas", add the 1st PDF to the left side, and the 2nd PDF to the right side. My initial question is: What is the method to generate an output PDF that does not…
jumbopap
  • 3,969
  • 5
  • 27
  • 47
3
votes
1 answer

pypdf errors - module object has no attribute number

here's the code I am using import os import decimal from pyPdf import PdfFileReader path = r"E:\python\Real Python\Real Python\Course materials\Chapter 8\Practice files" inputFileName = os.path.join(path,"Pride and Prejudice.pdf") inputFile =…
faraz
  • 2,603
  • 12
  • 39
  • 61
3
votes
1 answer

appending %%EOF to a PDF file in python

I'm trying to open a PDF with pyPdf. I get the following error: pyPdf.utils.PdfReadError: EOF marker not found I thought that I should add the EOF myself. However, I don't want to write bytes. Isn't it OS specific? I want to call something like…
anonymous
  • 193
  • 2
  • 5
2
votes
1 answer

pyPdf output files are the same size regardless of page count

I'm trying to use pyPdf to extract a few pages from a large pdf to a separate file. Whenever I do, the resulting filesize is nearly identical to the source file. I think it has something to do with the bookmarks inside the files, because it the…
user2682863
  • 3,097
  • 1
  • 24
  • 38
2
votes
1 answer

How to merge pages of a pdf file into a single vertically combined page with python

I have tried merge_page method in pypdf and pdfrw but they stact one page over other, how do I proceed? below code which i tried, similar with both modules from pdfrw import PdfReader, PdfWriter, PageMerge def…
V Falcon
  • 25
  • 4
2
votes
1 answer

Why does pypdf stuff text with extra spaces when extracting text?

pypdf==3.11.0, like previous versions, returns text strings with the occasional inserted single space. But Windows Search and the "Find" in Adobe reader find the text unadulterated, and if you try finding the text string with the extra spaces…
PMSK
  • 61
  • 4
2
votes
3 answers

Issue with loading online pdf in python notebook using langchain PyPDFLoader

I am trying to load with python langchain library an online pdf from: http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf This is the code that I'm running locally: loader =…
2
votes
1 answer

How to properly attach a file to a PDF using PyPDF2?

I'm trying to attach a file to a PDF file, but I'm running into some issues. I'm not sure if I'm doing something wrong or if there's a bug in PyPDF2. I'm using Python 3.10.2 for this and I downloaded the newest package for PyPDF2 through pip. These…
bblizzard
  • 618
  • 5
  • 7
2
votes
0 answers

Textract - windows10 - shell error - failed with exit code 127

The below code works fine for txt file but doesn't work with pdf files. import textract text = textract.process(r'C:\Users\Python_files\accounts.txt') However, I cannot seem to figure out what the problem is in the below code snippet: import…
2
votes
1 answer

Extract text based on annots from PDF using Python and PyPDF2

I am trying to read the below PDF programmatically using Python to extract useful information. Here, the "attachments" are basically links that point to specific pages inside the same PDF. I came to know that these are called "annots" and there is…
saran3h
  • 12,353
  • 4
  • 42
  • 54
2
votes
0 answers

my python code return an error because of the PyPDF2.PdfFileWriter func of PyPDF2 library

im having an issue with this line: pdf_writer = PyPDF2.PdfFileWriter(strict=False) it return this: Multiple definitions in dictionary at byte 0x1fd19 for key /Info Multiple definitions in dictionary at byte 0x1fd25 for key /Info Multiple…
VullWen
  • 19
  • 1
2
votes
1 answer

ImportError: cannot import name 'PdfReader' from 'PyPDF2'

I installed the PyPDF2 package using pip and got the following message after the installation: !pip install PyPDF2 Collecting PyPDF2 Downloading PyPDF2-2.11.1-py3-none-any.whl (220 kB) -------------------------------------- 220.4/220.4 kB…
2
votes
1 answer

How to solve (cid:x) pdfplumber python text extraction

PDF_Doc I've been working with the pdfplumber library to extract text from pdf documents and it's been fine, however in the documents I'm working on now, I just get spaces and lots of (cid:x) instead of text. Any solution? Thanks with…
foliveir
  • 59
  • 5