Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
3
votes
1 answer

Problem running a python script (pypdf/hex errors)

I am trying to create a Python script using the PyPDF Module. What the script does it take the 'Root' folder, merges all the PDFs in it and outputs the merged PDF in an 'Output' folder and renames it to 'Root.pdf' (the folder which containes the…
Brian
  • 1,951
  • 16
  • 56
  • 101
3
votes
0 answers

PDFs written by PyPDF2 showing changes when opened in Acrobat

I'm using Python and PyPDF2 to generate a set of PDFs based on a template with form fields. The PDFs are created and all of the fields are filled correctly, but when I open the PDFs in Adobe Acrobat they show changes made to the file (i.e., the…
3
votes
1 answer

How to convert from PDF to TXT without unintended line breaks?

I am trying to convert a very clean PDF file into txt file using python. I have tried using pyPDF2 and PDFMiner, both worked perfectly in text recognition. However, as in PDF the lines are wrapped, the extracted .txt file have unintended line break…
C.Ann.Sng
  • 63
  • 4
3
votes
2 answers

Read or save a PDF file uploaded to Flask

I'm uploading multiple files to flask using a form, I'm getting the file objects in the flask backend without a problem but the issue is I want to read the PDF files to extract text from them. I can't do it on the file objects I received from the…
Shashank Prasad
  • 474
  • 8
  • 11
3
votes
2 answers

merging pdf files with pypdf

I am writing a script that parses an internet site (maya.tase.co.il) for links, downloads pdf file and merges them. It works mostly, but merging gives me different kinds of errors depending on the file. I cant seem to figure out why. I cut out the…
user850498
  • 717
  • 1
  • 9
  • 22
3
votes
0 answers

splitting pages with pyPdf, getting wrong page size

Possible Duplicate: Why my code not correctly split every page in a scanned pdf? I am using pyPdf to split pdf pages. Everything works ok, but page sizes are not the same. Original page size: 1000px p1.mediaBox.upperRight = (w/2, h) # get…
lolalola
  • 3,773
  • 20
  • 60
  • 96
3
votes
1 answer

Merge PDF Files using python PyPDF2

I have watched a video to learn how to merge PDF files into one PDF file. I tried to modify a little in the code so as to deal with a folder which has the PDF files The main folder (Spyder) has the Demo.py and this is the code import os from PyPDF2…
YasserKhalil
  • 9,138
  • 7
  • 36
  • 95
3
votes
3 answers

Textract: failed with exit code 127 // windows 10 // pdftotext

When I'm trying to run my (after deploying with pyinstaller) program for reading and converting a PDF file and entering it into a google sheet. I get the error shown in the image below. However I can not seem to figure out what the problem…
Thomas Broek
  • 59
  • 2
  • 7
3
votes
2 answers

PdfReadError: File has not been decrypted

Currently I am using the PyPDF 2 and i also tried PyPDF 4 also as a dependency. I have encountered some encrypted files and handled them as you normally would (in the following code): import PyPDF2 import PyPDF4 pdfFileObj = open(r'path', 'rb') #…
Abby
  • 31
  • 3
3
votes
0 answers

tabula_py issue How to extract pdf table data spread in multiple pages

I am trying to extract all tables data from a pdf using tabula_py as: df=tabula.read_ptabula.read_pdf(test_pdf,stream=True,multiple tables=True,pages="all") The pdf has 3 tables. Second table is on 2 pages. When I try len(df) , it returns 4…
Sharon
  • 51
  • 3
3
votes
1 answer

How to correct this error PyPDF2.utils.PdfReadError: Cannot read an empty file

I have this code that returns an error packet = io.BytesIO() c = canvas.Canvas(packet) packet.seek(0) new_pdf = PdfFileReader(packet) template = PdfFileReader(open('path_to_template'), "rb") output = PdfFileWriter() page =…
Ptar
  • 374
  • 4
  • 16
3
votes
0 answers

Decompress PDF FlateDecode Filter Annotations with PyPDF2

I am trying the following code to extract the destination or uri of hyperlinks in PDF file via PyPDF2, but I encountered some encoded destinations. I tried to decompress them by filter.decodeStreamData(); however, my result is still unreadable :( I…
Fishu
  • 31
  • 2
3
votes
1 answer

PyPDF2 can't read non-English characters, returns empty string on extractText()

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long) that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page…
gemgr
  • 55
  • 7
3
votes
1 answer

Extracting the keywords from PDF metadata in Python

I have a PDF file from which I want to obtain some information from its metada. To do so, I follow the follwoing procedure: from PyPDF2 import PdfFileReader mypath = "your_pdf_file.pdf" pdf_toread = PdfFileReader(open(mypath, 'rb')) pdf_info =…
msh855
  • 1,493
  • 1
  • 15
  • 36
3
votes
1 answer

PyPDF2: Error -5 while decompressing data: incomplete or truncated stream

I'm having problem with Incomplete or truncated stream while trying to pull data out of PDF interactive form. Could anyone help me with this please PDFfile = open(fname, "rb") pdfread = p2.PdfFileReader(PDFfile) I'm having below error when i…
user12515392