Highest Voted 'pdfplumber' Questions

1

vote

1 answer

Is there a way to extract sentences after bold text in Python?

I have extracted some bold text from a pdf in python. Which works fine. But I want to extract also the sentence, or more then one sentence after the bold text, e.g. "Blue sky is what we see when we look up." I can extract the blue sky part. But I'm…

python pdfplumber

asked Aug 31 '22 at 18:51

Ben

11
1

1

vote

1 answer

How to filter text within a certain area using pdfPlumber and open CV?

I've got a bunch of pdf files which are from conference proceedings. Every pdf file's structure looks like: Tile with bold, large size font Author1 Author2 AuthorN Afflication1 …

python pdf pdfplumber

asked Aug 18 '22 at 08:41

Valuex

104
1
10

1

vote

2 answers

extracting images from PDF with page and screen coordinate information

I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd…

image pdf pdfminer pdfplumber

asked Jul 11 '22 at 10:00

peter.murray.rust

37,407
44
153
217

1

vote

0 answers

How to extract tables from PDFs while pulling in non-table text section identifiers

I'm working through extracting tables using pdfplumber in Python from a PDF that has mostly-consistent structure between pages. My goal is to extract each of the 2 tables under each section header (white font highlighted blue) on each page. See…

python pdf-extraction pdfplumber

asked Jan 28 '22 at 22:05

WinstonDoodle

23
3

1

vote

0 answers

Replace (cid:) with chars using Python REGEX findall where data extracted from PDF using PDFPlumber / pdfMiner

Following on from Replace (cid:) with chars using Python when extracting text from PDF files (I can't add a comment there), I attempted to convert the following with @josefz script but get unrecognisable strings not in the original PDF.…

python pdfminer pdfplumber

asked Jan 27 '22 at 07:30

DaveC

21
3

1

vote

1 answer

ocrmypdf - could not find source-pdf?

i would like to use ocrmypdf to convert some pdf-file from a picture to a readable pdf - Tried it with the following simple code: (the invoice.pdf is of course available in the same path as the python-script and the output.pdf should be…

python pdf ocr pdfplumber ocrmypdf

asked Jan 14 '22 at 22:37

Rapid1898

895
1
10
32

1

vote

0 answers

How to extract data from messy PDF file with no standard formatting?

I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able…

python dataframe pdf pdf-scraping pdfplumber

asked Dec 14 '21 at 12:33

Aamir Khan Maarofi

157
2
13

1

vote

1 answer

pdfplumber | Extract text from dynamic column layouts

Attempted Solution at bottom of post. I have near-working code that extracts the sentence containing a phrase, across multiple lines. However, some pages have columns. So respective outputs are incorrect; where separate texts are wrongly merged…

python if-statement text-extraction information-extraction pdfplumber

asked Nov 30 '21 at 13:56

StressedBoi69420

1,376
1
12
40

1

vote

1 answer

Scraping a sentence across many lines | Recursive error unresolved

Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines). I am able to print() the line the phrase appears in. Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from…

python recursion pypdf pdfplumber

asked Nov 29 '21 at 14:36

StressedBoi69420

1,376
1
12
40

1

vote

1 answer

Python - inserting header into a csv

I'm developing a script that extracts text from all pdf files in a directory via a loop and inserts them into individual cells of a csv file. I can successfully write the output into the cells. However, I need the csv file to contain the header…

python csv pdf pdfplumber

asked Nov 11 '21 at 15:22

Daniel Hutchinson

155
14

1

vote

1 answer

Python & Pandas: combining multiple rows into single cell

I'm writing a script that extracts text from a pdf file and inserts it as a string into a single csv row. Using pdfplumbr I can successfully extract the text, with each page's text inserted into the csv as an individual row. However, I'm struggling…

python pandas csv pdfplumber

asked Nov 10 '21 at 13:09

Daniel Hutchinson

155
14

1

vote

0 answers

Separating large PDF document into smaller documents based on content

I have a large pdf file with very specific formatting, a bunch of reports if you will, all in one big pdf document. I'm using pdfplumber to extract specific text within a bounding box on each page. I've called this variable scene_text. The value of…

python pdf pypdf pdfplumber

asked Oct 29 '21 at 16:59

John

11
1

1

vote

1 answer

How to complete for loop with pdfplumber?

Problem I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16 when the code has returned my this error. Goal I need to scrape a pdf that looks like this (I wanted to attach the pdf…

python regex pdf pdfplumber

asked Sep 25 '21 at 18:15

Edo Grm

13
4

1

vote

2 answers

How to stop pdfplumber from reading the header of every pages?

I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I program pdfplumber to not read the page headers(titles) and the page…

python python-3.x pdfplumber

asked Apr 01 '21 at 07:58

Anandakrishnan

349
5
10

1

vote

2 answers

How to remove space between English Words after extracting from pdfplumber

I have extracted text from pdf (using pdfplumber) to txt but there are some spaces between words that are not in PDF file. I have tried to nltk to find out Words using "Previous_word" + "current_word" combination and checking if they exist in…

python pdf pdfplumber

asked Mar 15 '21 at 13:04

Joy

145
2
9

Questions tagged [pdfplumber]