Questions tagged [pdfplumber]

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

95 questions
1
vote
1 answer

Is there a way to extract sentences after bold text in Python?

I have extracted some bold text from a pdf in python. Which works fine. But I want to extract also the sentence, or more then one sentence after the bold text, e.g. "Blue sky is what we see when we look up." I can extract the blue sky part. But I'm…
Ben
  • 11
  • 1
1
vote
1 answer

How to filter text within a certain area using pdfPlumber and open CV?

I've got a bunch of pdf files which are from conference proceedings. Every pdf file's structure looks like: Tile with bold, large size font Author1 Author2 AuthorN Afflication1 …
Valuex
  • 104
  • 1
  • 10
1
vote
2 answers

extracting images from PDF with page and screen coordinate information

I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd…
peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217
1
vote
0 answers

How to extract tables from PDFs while pulling in non-table text section identifiers

I'm working through extracting tables using pdfplumber in Python from a PDF that has mostly-consistent structure between pages. My goal is to extract each of the 2 tables under each section header (white font highlighted blue) on each page. See…
1
vote
0 answers

Replace (cid:) with chars using Python REGEX findall where data extracted from PDF using PDFPlumber / pdfMiner

Following on from Replace (cid:) with chars using Python when extracting text from PDF files (I can't add a comment there), I attempted to convert the following with @josefz script but get unrecognisable strings not in the original PDF.…
DaveC
  • 21
  • 3
1
vote
1 answer

ocrmypdf - could not find source-pdf?

i would like to use ocrmypdf to convert some pdf-file from a picture to a readable pdf - Tried it with the following simple code: (the invoice.pdf is of course available in the same path as the python-script and the output.pdf should be…
Rapid1898
  • 895
  • 1
  • 10
  • 32
1
vote
0 answers

How to extract data from messy PDF file with no standard formatting?

I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able…
1
vote
1 answer

pdfplumber | Extract text from dynamic column layouts

Attempted Solution at bottom of post. I have near-working code that extracts the sentence containing a phrase, across multiple lines. However, some pages have columns. So respective outputs are incorrect; where separate texts are wrongly merged…
1
vote
1 answer

Scraping a sentence across many lines | Recursive error unresolved

Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines). I am able to print() the line the phrase appears in. Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from…
StressedBoi69420
  • 1,376
  • 1
  • 12
  • 40
1
vote
1 answer

Python - inserting header into a csv

I'm developing a script that extracts text from all pdf files in a directory via a loop and inserts them into individual cells of a csv file. I can successfully write the output into the cells. However, I need the csv file to contain the header…
1
vote
1 answer

Python & Pandas: combining multiple rows into single cell

I'm writing a script that extracts text from a pdf file and inserts it as a string into a single csv row. Using pdfplumbr I can successfully extract the text, with each page's text inserted into the csv as an individual row. However, I'm struggling…
1
vote
0 answers

Separating large PDF document into smaller documents based on content

I have a large pdf file with very specific formatting, a bunch of reports if you will, all in one big pdf document. I'm using pdfplumber to extract specific text within a bounding box on each page. I've called this variable scene_text. The value of…
John
  • 11
  • 1
1
vote
1 answer

How to complete for loop with pdfplumber?

Problem I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16 when the code has returned my this error. Goal I need to scrape a pdf that looks like this (I wanted to attach the pdf…
Edo Grm
  • 13
  • 4
1
vote
2 answers

How to stop pdfplumber from reading the header of every pages?

I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I program pdfplumber to not read the page headers(titles) and the page…
Anandakrishnan
  • 349
  • 5
  • 10
1
vote
2 answers

How to remove space between English Words after extracting from pdfplumber

I have extracted text from pdf (using pdfplumber) to txt but there are some spaces between words that are not in PDF file. I have tried to nltk to find out Words using "Previous_word" + "current_word" combination and checking if they exist in…
Joy
  • 145
  • 2
  • 9