Questions tagged [pdfplumber]

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

95 questions
0
votes
0 answers

Pdfplumber cannot recognize table

image.reset().debug_tablefinder() result how to convert it into tables that can be recognized by pdfplumber?
Anita
  • 1
0
votes
1 answer

How to convert every page of pdf to a pdf object using python

I want to create each page of a pdf file to a new pdf object. I am following the mentioned code snippet https://stackoverflow.com/a/490203/13291630 but here it is shown as the creation of a new file, but I want to just create a pdf object without…
Mr Anonymous
  • 75
  • 10
0
votes
1 answer

get the table by passing table header in pdf using python

I have a pdf with multiple tables in it. I need to pass table header and get the respected table For example: I will pass the Table name as "daily historical stock prices & volumes", then it must give above table.
End user
  • 77
  • 3
0
votes
2 answers

How to extract table details into rows and columns using pdfplumber

I am using pdfplumber to extract tables from pdf. But the table in use does not have visible vertical lines separating content so the the data extracted are into 3 rows and one huge column. I would like the above table to come into 13 rows. import…
0
votes
0 answers

Why does pdfplumber yield no data?

I usually use pdfplumber to scrape data and text from pdfs, and 99.99% of the time, everything is fine. Though today, I have encountered a case where i can open the pdf file (using pdfplumber.open), but not extract any text / word / table. I know…
Odhian
  • 351
  • 5
  • 14
0
votes
1 answer

How to print the next line in Python with text extracted using pdfplumber

How can I print the next line from the text that I extracted from a PDF using pdfPlumber extract.text function? I have tried line.next() but it does not work. The actual job name is on the line after the "Job Name". As per example below. Job…
Autom8
  • 385
  • 2
  • 3
  • 10
0
votes
0 answers

Encoding issues during the extraction text from pdf file using pdfplumber

I would like to extract the content of the following pdf file but it returns a meaningless result. I assume that it might be related to the encoding side of the file but the same extraction code works for many other files on the same infrastructure.…
fillo
  • 365
  • 1
  • 12
0
votes
1 answer

List Index Out of Range when using PDF Plumber

Hello I am extracting text from PDF using pdf plumber and writing it to a text file but I am getting index out of range error. import glob import pdfplumber for filename in glob.glob('*.pdf'): pdf = pdfplumber.open(filename) OutputFile =…
0
votes
1 answer

Extract text from pdf file using pdfplumber

I want to extract text from a pdf file, tried: directory = r'C:\Users\foo\folder' for x in os.listdir(directory): print(x) x = x.replace('.pdf','') filename = os.fsdecode(x) print(x) if filename.endswith('.pdf'): with…
nilsinelabore
  • 4,143
  • 17
  • 65
  • 122
0
votes
2 answers

How to go about isolating dollar amounts using Regex?

I used the PDFPlumber library to extract all the lines in my PDF, a sample line extract looks like this: Total Return Transportation $16.01 The goal is to put all of these into a data frame. How do I use regex to group this line so that I may…
pvell
  • 1
0
votes
1 answer

How to optimize (also RAM wise) code that is saving words from PDF to Python object and later into database?

I am looking for the most efficient way of saving text from PDF files into my database. Currently I am using pdfplumber with standard code looking like this: my_string = '' with pdfplumber.open(text_file_path) as pdf: for page in pdf.pages: …
Peksio
  • 525
  • 6
  • 25
0
votes
1 answer

Converting pytesseract.Output.DATAFRAME into bytes or ocr'ed pdf

Is it possible to write to a pdf file retroactively using pytesseract.image_to_data() output? For my OCR pipeline, I needed granular access to my pdf's ocr'ed data. I requested that using this method: ocr_dataframe = pytesseract.image_to_data( …
abrezey
  • 135
  • 9
0
votes
1 answer

How to ignore table and its content while extracting text from pdf

So far I am successful extracting the text content from a pdf file. I am stuck to a point where i have to extract text content outside of the table (ignore table and its content) and need help The Pdf can be downloaded from here import…
go sgenq
  • 313
  • 3
  • 13
0
votes
1 answer

PDFPlumber returning symbols and inaccurate text

I'm trying to extract text from a pdf file using PDFplumber import pdfplumber pdf = pdfplumber.open(r"https://www.lupin.com/pdf/financials/subsidiaries/multicare-pharmaceuticals-philippines-inc-philippines-2018.pdf") for ps in pdf.pages: …
Nikhil T
  • 1
  • 1
0
votes
0 answers

I am having issues extracting hindi text from pdf in python

I am using pdfplumber in python.. It is not extracting hindi text well. It is showing wrong results. input :माँ, मैं रात का खाना ले आऊँगा। output: म ,ाँ म ैं र त क ख न ले आऊाँग । I want the exact output.. Any solution ??