Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
Questions tagged [pdfplumber]
95 questions
0
votes
1 answer
How to extract radiobutton / checkbox information with python from a pdf-file?
i would like to get the radio-button / checkbox information from a pdf-document -
I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules.
I can parse the text using this code - but for the radio-buttons i get…

Rapid1898
- 895
- 1
- 10
- 32
0
votes
1 answer
How do you get the filename from a `pdfplumber.pdf.PDF`?
I have a function that is passed a pdfplumber.pdf.PDF argument and I need to reference the filename of the PDF. Is there any way to get the filename from a pdfplumber.pdf.PDF class instance?

Keegan Skeate
- 21
- 2
- 6
0
votes
1 answer
pdfplumber to_image() OSError: exception: access violation writing 0x0000000000000008 in Windows 10
I was trying to use pdfplumber library in python (ver. 3.10.6) to convert some pdf pages to images but pdfplumber to_image() method throws the following error:
import pdfplumber
>>> myDOc = pdfplumber.open("CV.pdf")
>>> myImg =…

Kuba Jjj
- 31
- 4
0
votes
0 answers
Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library
Code
import pdfplumber
ecdata = ""
with pdfplumber.open("XYZ Transcript.pdf") as pdf:
for i in range(len(pdf.pages)):
print("Page No.: ", i+1)
page_obj = pdf.pages[i]
page = page_obj.within_bbox((70, 50, page_obj.width,…
0
votes
1 answer
When running pdfplumber in python I got an error --> CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team
I'm using a Python script that extracts the text content of a PDF file using pdfplumber.
When running pdfplumber in python I got an error like this
CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team.
Therefore,…

Lintang Gilang Pratama
- 89
- 1
- 5
0
votes
2 answers
extract the specific text from pdfs using python
I have tried different python libraries to extract the specific text from pdfs, I have to extract text under the heading pdf1
from this pdf, I have to extract the text starting from Case 1 to diamond ◆ bold.
The next pdf contains the data in a…

Arvind Singh
- 1
- 1
0
votes
1 answer
how to do complex pdf extraction with regex
I have a PDF file which contains Lottery Tickets winners, i want to extract all win tickets according to their prizes.
PDF file
i tried this:
import re
import pdfplumber
prize_re = re.compile(r"^\d[a-z]")
cons_prize_re =…

Chams Agouni
- 364
- 1
- 12
0
votes
1 answer
Pdfplumber - Extract a table in pdf without any borders
I am trying to extract the table as shown in the image here into a data frame. I tried using tabula-py to extract the code but read_pdf returned me []. Not sure if tabula-py is the right module to use. Can anyone help?

PythonEnthusiast
- 37
- 6
0
votes
1 answer
Mapping highlighted text in a pdf document to a character index range in it's .txt output
I have a project where I have to highlight text in a structured PDF document and classify it so I can perform regex on multiple substrings and give their respective variables the proper values. Is there a way to have a PDF prompted to the screen…

PeterQuando
- 75
- 1
- 7
0
votes
1 answer
how to take take multiple pages as input in pdfplumber?
I am using pdfplumber to take input from a pdf file.
My question is how can I take from page 1-7 input using pdfplumber.
I'm using this code:
filename = "1st Year 1stSemester.pdf"
pdf = pdfplumber.open(filename)
totalpages = len(pdf.pages)
p0 =…

NobinPegasus
- 545
- 2
- 16
0
votes
2 answers
pdfplumber memory hogging (crash with large pdf files)
Using pdfplumber to extract text from large pdf files crashes it.
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
**do something**

Filipe Lemos
- 500
- 3
- 13
0
votes
1 answer
Pdfplumber misses first column and last row for all tables within a schematic
I am new to pdfplumber, and I have fallen amazed under how it extracts text from tables.
Its easy to work for all-page tables, but in my case, I am using some topological schematics with somes tables inside.
It fails to extract the first column and…

Pablo
- 557
- 3
- 16
0
votes
1 answer
Can pdfplumber extract tables for my scanned pdfs?
(I know that pdfplumber is mainly geared towards computer-generated PDFs.
However, before I spend a couple of days handtyping data from my scanned PDFs, I thought I'd ask if pdfplumber could somehow help me.)
My problem:
I have scanned PDFs from…

Tototulbi
- 15
- 4
0
votes
1 answer
How to count the number of words from a list from a text extract in a pdf using Python?
I am trying to count a serie of words extract from a PDF but I get only 0 and it is not correct.
total_number_of_keywords = 0
pdf_file = "CapitalCorp.pdf"
tables=[]
words = ['blank','warrant ','offering','combination ','SPAC','founders']
count={} #…

Math4264
- 3
- 2
0
votes
1 answer
pdfplumber extract_text function also extracts text from the table. Only want to extract text outside of the table
I have a pdf that contains text and tables. I want to extract both of them but when I used the extract_text function it also extracts the content which is inside of the table. I just want to only extract the text which is outside the table and the…

Deepam
- 1
- 2