Questions tagged [pdfplumber]

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

95 questions
0
votes
1 answer

How to extract radiobutton / checkbox information with python from a pdf-file?

i would like to get the radio-button / checkbox information from a pdf-document - I had a look at pdfplumber and pypdf2 - but was not able to find a solution with this modules. I can parse the text using this code - but for the radio-buttons i get…
Rapid1898
  • 895
  • 1
  • 10
  • 32
0
votes
1 answer

How do you get the filename from a `pdfplumber.pdf.PDF`?

I have a function that is passed a pdfplumber.pdf.PDF argument and I need to reference the filename of the PDF. Is there any way to get the filename from a pdfplumber.pdf.PDF class instance?
Keegan Skeate
  • 21
  • 2
  • 6
0
votes
1 answer

pdfplumber to_image() OSError: exception: access violation writing 0x0000000000000008 in Windows 10

I was trying to use pdfplumber library in python (ver. 3.10.6) to convert some pdf pages to images but pdfplumber to_image() method throws the following error: import pdfplumber >>> myDOc = pdfplumber.open("CV.pdf") >>> myImg =…
Kuba Jjj
  • 31
  • 4
0
votes
0 answers

Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library

Code import pdfplumber ecdata = "" with pdfplumber.open("XYZ Transcript.pdf") as pdf: for i in range(len(pdf.pages)): print("Page No.: ", i+1) page_obj = pdf.pages[i] page = page_obj.within_bbox((70, 50, page_obj.width,…
0
votes
1 answer

When running pdfplumber in python I got an error --> CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team

I'm using a Python script that extracts the text content of a PDF file using pdfplumber. When running pdfplumber in python I got an error like this CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore,…
0
votes
2 answers

extract the specific text from pdfs using python

I have tried different python libraries to extract the specific text from pdfs, I have to extract text under the heading pdf1 from this pdf, I have to extract the text starting from Case 1 to diamond ◆ bold. The next pdf contains the data in a…
0
votes
1 answer

how to do complex pdf extraction with regex

I have a PDF file which contains Lottery Tickets winners, i want to extract all win tickets according to their prizes. PDF file i tried this: import re import pdfplumber prize_re = re.compile(r"^\d[a-z]") cons_prize_re =…
Chams Agouni
  • 364
  • 1
  • 12
0
votes
1 answer

Pdfplumber - Extract a table in pdf without any borders

I am trying to extract the table as shown in the image here into a data frame. I tried using tabula-py to extract the code but read_pdf returned me []. Not sure if tabula-py is the right module to use. Can anyone help?
0
votes
1 answer

Mapping highlighted text in a pdf document to a character index range in it's .txt output

I have a project where I have to highlight text in a structured PDF document and classify it so I can perform regex on multiple substrings and give their respective variables the proper values. Is there a way to have a PDF prompted to the screen…
PeterQuando
  • 75
  • 1
  • 7
0
votes
1 answer

how to take take multiple pages as input in pdfplumber?

I am using pdfplumber to take input from a pdf file. My question is how can I take from page 1-7 input using pdfplumber. I'm using this code: filename = "1st Year 1stSemester.pdf" pdf = pdfplumber.open(filename) totalpages = len(pdf.pages) p0 =…
NobinPegasus
  • 545
  • 2
  • 16
0
votes
2 answers

pdfplumber memory hogging (crash with large pdf files)

Using pdfplumber to extract text from large pdf files crashes it. with pdfplumber.open("data/my.pdf") as pdf: for page in pdf.pages: **do something**
Filipe Lemos
  • 500
  • 3
  • 13
0
votes
1 answer

Pdfplumber misses first column and last row for all tables within a schematic

I am new to pdfplumber, and I have fallen amazed under how it extracts text from tables. Its easy to work for all-page tables, but in my case, I am using some topological schematics with somes tables inside. It fails to extract the first column and…
Pablo
  • 557
  • 3
  • 16
0
votes
1 answer

Can pdfplumber extract tables for my scanned pdfs?

(I know that pdfplumber is mainly geared towards computer-generated PDFs. However, before I spend a couple of days handtyping data from my scanned PDFs, I thought I'd ask if pdfplumber could somehow help me.) My problem: I have scanned PDFs from…
0
votes
1 answer

How to count the number of words from a list from a text extract in a pdf using Python?

I am trying to count a serie of words extract from a PDF but I get only 0 and it is not correct. total_number_of_keywords = 0 pdf_file = "CapitalCorp.pdf" tables=[] words = ['blank','warrant ','offering','combination ','SPAC','founders'] count={} #…
Math4264
  • 3
  • 2
0
votes
1 answer

pdfplumber extract_text function also extracts text from the table. Only want to extract text outside of the table

I have a pdf that contains text and tables. I want to extract both of them but when I used the extract_text function it also extracts the content which is inside of the table. I just want to only extract the text which is outside the table and the…
Deepam
  • 1
  • 2