0

I have tried different python libraries to extract the specific text from pdfs, I have to extract text under the heading pdf1 from this pdf, I have to extract the text starting from Case 1 to diamond ◆ bold.

The next pdf contains the data in a different format pdf2. in this pdf I have to extract data from history to examination, then from examination to investigations with history and investigation as columns in an excel file and corresponding data in rows. and python regex cannot satisfy this condition because every pdf format is different and we want different type of text from different pdfs

apart from these types of pdf, I have 5+ different types of pdfs to process I have tried different python libraries like pdfminer, pdfplumber, PyMUPDF, pytesseract , textract, GROBID,

sample pdf:sample pdfs

code 1

import pdfplumber
import docx

file='Book_EM-Cases-Digest-Vol-2-Pediatric-Emergencies (1).pdf'

pdf=pdfplumber.open(file)

for page in pdf.pages:
    text=page.extract_text()

code 2


import fitz

file='Book_EM-Cases-Digest-Vol-2-Pediatric-Emergencies (1).pdf'


docum=docx.Document()
with fitz.open(file) as doc:
    for page in doc:
        text=page.get_text()

the above codes will extract the text for the whole page. but I want specific text. I know we can also use python regex to do this but I have a variety of different pdfs as well and its become difficult to use python regex for all pdfs

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • Show the code you've written and what was the issue with that. – Martin Thoma Jun 30 '22 at 04:54
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Jul 01 '22 at 09:12

2 Answers2

1

Grobid is not made for parsing such big PDF documents. It is designed to understand scholarly publication.

Anyway, there is a python client that can be useful: https://github.com/kermitt2/grobid-client-python You can use the Huggingface space demo server: https://kermitt2-grobid.hf.space/ and you can parse the output XML with https://pypi.org/project/grobid-tei-xml/

Simple example:


pdf_file, status, text = self.grobid_client.process_pdf("processFulltextDocument",input_path)

if status == 200:
    doc = grobid_tei_xml.parse_document_xml(text)

    print(doc.abstract)

Luca Foppiano
  • 157
  • 12
0

Using the library PyMuPDF:-

  1. Find the coordinates of the blocks of the page using Page.get_text('dict')
  2. You will get the coordinates of the required text---> rect.
  3. Now for extracting the text Page.get_text(clip=rect,sort=False). Here, the rect is the coordinates of the rectangle box(text) that you want to extract.
Mohit Mehlawat
  • 344
  • 3
  • 6