Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
Questions tagged [pdfplumber]
95 questions
1
vote
1 answer
Is there a way to extract sentences after bold text in Python?
I have extracted some bold text from a pdf in python. Which works fine. But I want to extract also the sentence, or more then one sentence after the bold text, e.g. "Blue sky is what we see when we look up."
I can extract the blue sky part. But I'm…

Ben
- 11
- 1
1
vote
1 answer
How to filter text within a certain area using pdfPlumber and open CV?
I've got a bunch of pdf files which are from conference proceedings.
Every pdf file's structure looks like:
Tile with bold, large size font
Author1 Author2 AuthorN
Afflication1 …

Valuex
- 104
- 1
- 10
1
vote
2 answers
extracting images from PDF with page and screen coordinate information
I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd…

peter.murray.rust
- 37,407
- 44
- 153
- 217
1
vote
0 answers
How to extract tables from PDFs while pulling in non-table text section identifiers
I'm working through extracting tables using pdfplumber in Python from a PDF that has mostly-consistent structure between pages.
My goal is to extract each of the 2 tables under each section header (white font highlighted blue) on each page. See…

WinstonDoodle
- 23
- 3
1
vote
0 answers
Replace (cid:) with chars using Python REGEX findall where data extracted from PDF using PDFPlumber / pdfMiner
Following on from Replace (cid:) with chars using Python when extracting text from PDF files (I can't add a comment there), I attempted to convert the following with @josefz script but get unrecognisable strings not in the original PDF.…

DaveC
- 21
- 3
1
vote
1 answer
ocrmypdf - could not find source-pdf?
i would like to use ocrmypdf to convert some pdf-file from a picture to a readable pdf -
Tried it with the following simple code:
(the invoice.pdf is of course available in the same path as the python-script and the output.pdf should be…

Rapid1898
- 895
- 1
- 10
- 32
1
vote
0 answers
How to extract data from messy PDF file with no standard formatting?
I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able…

Aamir Khan Maarofi
- 157
- 2
- 13
1
vote
1 answer
pdfplumber | Extract text from dynamic column layouts
Attempted Solution at bottom of post.
I have near-working code that extracts the sentence containing a phrase, across multiple lines.
However, some pages have columns. So respective outputs are incorrect; where separate texts are wrongly merged…

StressedBoi69420
- 1,376
- 1
- 12
- 40
1
vote
1 answer
Scraping a sentence across many lines | Recursive error unresolved
Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines).
I am able to print() the line the phrase appears in.
Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from…

StressedBoi69420
- 1,376
- 1
- 12
- 40
1
vote
1 answer
Python - inserting header into a csv
I'm developing a script that extracts text from all pdf files in a directory via a loop and inserts them into individual cells of a csv file. I can successfully write the output into the cells. However, I need the csv file to contain the header…

Daniel Hutchinson
- 155
- 14
1
vote
1 answer
Python & Pandas: combining multiple rows into single cell
I'm writing a script that extracts text from a pdf file and inserts it as a string into a single csv row. Using pdfplumbr I can successfully extract the text, with each page's text inserted into the csv as an individual row. However, I'm struggling…

Daniel Hutchinson
- 155
- 14
1
vote
0 answers
Separating large PDF document into smaller documents based on content
I have a large pdf file with very specific formatting, a bunch of reports if you will, all in one big pdf document. I'm using pdfplumber to extract specific text within a bounding box on each page. I've called this variable scene_text. The value of…

John
- 11
- 1
1
vote
1 answer
How to complete for loop with pdfplumber?
Problem
I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16
when the code has returned my this error.
Goal
I need to scrape a pdf that looks like this (I wanted to attach the pdf…

Edo Grm
- 13
- 4
1
vote
2 answers
How to stop pdfplumber from reading the header of every pages?
I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I program pdfplumber to not read the page headers(titles) and the page…

Anandakrishnan
- 349
- 5
- 10
1
vote
2 answers
How to remove space between English Words after extracting from pdfplumber
I have extracted text from pdf (using pdfplumber) to txt but there are some spaces between words that are not in PDF file.
I have tried to nltk to find out Words using "Previous_word" + "current_word" combination and checking if they exist in…

Joy
- 145
- 2
- 9