Questions tagged [pypdf]

pypdf is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

A Pure-Python library built as a PDF toolkit. It is capable of:

  • extracting document information (title, author, ...),
  • splitting documents page by page,
  • merging documents page by page,
  • cropping pages,
  • merging multiple pages into a single page,
  • encrypting and decrypting PDF files.

By being Pure-Python, it should run on any Python platform without any dependencies on external libraries. It can also work entirely on StringIO objects rather than file streams, allowing for PDF manipulation in memory. It is therefore a useful tool for websites that manage or manipulate PDFs.

pypdf was inactive from 2010 to 2022. It got maintained in December 2022 again.

Relationship to PyPDF2

PyPDF2 was a fork of pyPdf.

PyPDF2 received a lot of updates in 2022, but PyPDF2 was deprecated in favor of pypdf.

pypdf==3.1.0 is essentially the same as PyPDF2==3.0.0. Just the package name was changed to pypdf.

See: https://pypdf.readthedocs.io/en/latest/meta/history.html

Links

1451 questions
-1
votes
3 answers

Data being written in a single line in some csv file

I have written some code to read data from a specific page of a "pdf" file and write it to a csv file using python. It does it's job only partially. However, when it comes to write data to a csv file, it writes those in a single line instead of the…
SIM
  • 21,997
  • 5
  • 37
  • 109
-1
votes
1 answer

IOError: [Errno 22] Invalid argument

I am trying to concatenate all the pdf into one pdf thereby using PyPDF2 library. I am using python 2.7 for the same. My error is : >>> RESTART: C:\Users\Yash gupta\Desktop\first projectt\concatenate\test\New folder\test.py ['Invoice.pdf',…
Yash Gupta
  • 23
  • 2
  • 10
-1
votes
1 answer

Error writing 3000+ pdf files in one txt file with python 3

I am trying to extract text from 3000+ PDFs in one txt file (while I had to remove headers from each page): for x in range(len(files)-len(files)+15): pdfFileObj=open(files[x],'rb') pdfReader=PyPDF2.PdfFileReader(pdfFileObj) for pageNum…
Yuna Luzi
  • 318
  • 2
  • 9
-1
votes
2 answers

Why does pyPdf2.PdfFileReader() require a file object as an input?

csv.reader() doesn't require a file object, nor does open(). Does pyPdf2.PdfFileReader() require a file object because of the complexity of the PDF format, or is there some other reason?
Zev Averbach
  • 1,044
  • 1
  • 11
  • 25
-1
votes
1 answer

PDFQuery + files on server

I'm trying to search for text string, say "can be", in document which is located on 'https://developer.apple.com/library/ios/documentation/ides/conceptual/AppDistributionGuide/AppDistributionGuide.pdf' For this purpose I'm using PDFQuery. Initially…
-2
votes
3 answers

Python Script for counting the number of Pages for each PDF in a directory

I am new to Python, and I am trying to create a script that will list all the PDF’s in a directory and the number of pages in each of the files. I have used the recommended code from this thread: Using Python to pull the number of pages in all the…
-2
votes
1 answer

how to extract a table column data present in pdf and stored inside a variable python

I have 3 tables (image pasted) all 3 table(have same columns) look same and i want data of address column (yellow colour) of 3 tables stored inside a variable.
Deepak Jain
  • 137
  • 1
  • 3
  • 27
-2
votes
1 answer

str' object has no attribute 'getNumPages

I am writing a little program that allows the user to open a pdf file, then the program adds image 1 to pages that contain text 1, image 2 to pages that contain text 2, and save the PDF file. But I kept getting this error "str' object has no…
Zac
  • 13
  • 1
  • 5
-2
votes
1 answer

Grabbing an article from a pdf file - Python

I have more than 5000 pdf files with at least 15 pages each and 20 pages at most. I used pypdf2 to find out which among the 5000 pdf files have the keyword I am looking for and on which page. Now I have the following data: I was wondering if there…
Mtrinidad
  • 157
  • 1
  • 11
-2
votes
1 answer

Removing Gridlines from Scanned Graph Paper Documents

I would like to remove gridlines from a scanned document using Python to make them easier to read. Here is a snippet of what we're working with: As you can see, there are inconsistencies in the grid, and to make matters worse the scanning isn't…
-2
votes
1 answer

PyPDF to read each PDF in a folder

I have the code below which is working. But it only reads one file at a time, which I have to insert in the code. How could I make this code read every PDF in a directory? PDF by PDF. import PyPDF2 import textract import re filename = 'file.pdf'…
Remo
  • 51
  • 5
-2
votes
1 answer

What are some alternatives to PyPDF2 for managing PDF files?

Attempting to read the daily works of a Parliament, I discovered the documents are splintered into many PDF documents which cannot be simply opened by the browser to read and must be downloaded individually. My basic idea is to download all the docs…
Akenaten
  • 91
  • 9
-2
votes
1 answer

How to convert the binary text generated in my .PDF to a string?

I am using this code: from PyPDF2 import PdfFileReader def text_extractor(path): with open(path, 'rb') as f: pdf = PdfFileReader(f) # get the first page page = pdf.getPage(0) print(page) print('Page…
Toni
  • 97
  • 1
  • 10
-2
votes
2 answers

Installing pyPdf library module in kali linux to work on Pdf files

How to install pyPdf module in my kali linux ? I tried with $sudo apt-get install python-pyPdf getting the error as E:unable to locate the package python-pyPdf
Chiru R
  • 1
  • 1
-3
votes
1 answer

How to read pdf file directly from github (without downloading or fetching it from github)?

I understand that we could extract text from pdf file. for example, import pandas as pd import PyPDF2 # ============================================================================= # Extracting from pdf files #…
JamesAng
  • 344
  • 2
  • 9
1 2 3
96
97