Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
Questions tagged [pdfplumber]
95 questions
0
votes
0 answers
Pdfplumber cannot recognize table
image.reset().debug_tablefinder()
result
how to convert it into tables that can be recognized by pdfplumber?

Anita
- 1
0
votes
1 answer
How to convert every page of pdf to a pdf object using python
I want to create each page of a pdf file to a new pdf object. I am following the mentioned code snippet https://stackoverflow.com/a/490203/13291630 but here it is shown as the creation of a new file, but I want to just create a pdf object without…

Mr Anonymous
- 75
- 10
0
votes
1 answer
get the table by passing table header in pdf using python
I have a pdf with multiple tables in it. I need to pass table header and get the respected table
For example:
I will pass the Table name as "daily historical stock prices & volumes", then it must give above table.

End user
- 77
- 3
0
votes
2 answers
How to extract table details into rows and columns using pdfplumber
I am using pdfplumber to extract tables from pdf. But the table in use does not have visible vertical lines separating content so the the data extracted are into 3 rows and one huge column.
I would like the above table to come into 13 rows.
import…

walter_anderson
- 19
- 1
- 8
0
votes
0 answers
Why does pdfplumber yield no data?
I usually use pdfplumber to scrape data and text from pdfs, and 99.99% of the time, everything is fine.
Though today, I have encountered a case where i can open the pdf file (using pdfplumber.open), but not extract any text / word / table. I know…

Odhian
- 351
- 5
- 14
0
votes
1 answer
How to print the next line in Python with text extracted using pdfplumber
How can I print the next line from the text that I extracted from a PDF using
pdfPlumber extract.text function?
I have tried line.next() but it does not work.
The actual job name is on the line after the "Job Name". As per example below.
Job…

Autom8
- 385
- 2
- 3
- 10
0
votes
0 answers
Encoding issues during the extraction text from pdf file using pdfplumber
I would like to extract the content of the following pdf file but it returns a meaningless result. I assume that it might be related to the encoding side of the file but the same extraction code works for many other files on the same infrastructure.…

fillo
- 365
- 1
- 12
0
votes
1 answer
List Index Out of Range when using PDF Plumber
Hello I am extracting text from PDF using pdf plumber and writing it to a text file but I am getting index out of range error.
import glob
import pdfplumber
for filename in glob.glob('*.pdf'):
pdf = pdfplumber.open(filename)
OutputFile =…

Haris Trading
- 41
- 7
0
votes
1 answer
Extract text from pdf file using pdfplumber
I want to extract text from a pdf file, tried:
directory = r'C:\Users\foo\folder'
for x in os.listdir(directory):
print(x)
x = x.replace('.pdf','')
filename = os.fsdecode(x)
print(x)
if filename.endswith('.pdf'):
with…

nilsinelabore
- 4,143
- 17
- 65
- 122
0
votes
2 answers
How to go about isolating dollar amounts using Regex?
I used the PDFPlumber library to extract all the lines in my PDF, a sample line extract looks like this:
Total Return Transportation $16.01
The goal is to put all of these into a data frame. How do I use regex to group this line so that I may…

pvell
- 1
0
votes
1 answer
How to optimize (also RAM wise) code that is saving words from PDF to Python object and later into database?
I am looking for the most efficient way of saving text from PDF files into my database. Currently I am using pdfplumber with standard code looking like this:
my_string = ''
with pdfplumber.open(text_file_path) as pdf:
for page in pdf.pages:
…

Peksio
- 525
- 6
- 25
0
votes
1 answer
Converting pytesseract.Output.DATAFRAME into bytes or ocr'ed pdf
Is it possible to write to a pdf file retroactively using pytesseract.image_to_data() output?
For my OCR pipeline, I needed granular access to my pdf's ocr'ed data. I requested that using this method:
ocr_dataframe = pytesseract.image_to_data(
…

abrezey
- 135
- 9
0
votes
1 answer
How to ignore table and its content while extracting text from pdf
So far I am successful extracting the text content from a pdf file. I am stuck to a point where i have to extract text content outside of the table (ignore table and its content) and need help
The Pdf can be downloaded from here
import…

go sgenq
- 313
- 3
- 13
0
votes
1 answer
PDFPlumber returning symbols and inaccurate text
I'm trying to extract text from a pdf file using PDFplumber
import pdfplumber
pdf = pdfplumber.open(r"https://www.lupin.com/pdf/financials/subsidiaries/multicare-pharmaceuticals-philippines-inc-philippines-2018.pdf")
for ps in pdf.pages: …

Nikhil T
- 1
- 1
0
votes
0 answers
I am having issues extracting hindi text from pdf in python
I am using pdfplumber in python..
It is not extracting hindi text well. It is showing wrong results.
input :माँ, मैं रात का खाना ले आऊँगा।
output: म ,ाँ म ैं र त क ख न ले आऊाँग ।
I want the exact output..
Any solution ??