I have tried different python libraries to extract the specific text from pdfs, I have to extract text under the heading pdf1 from this pdf, I have to extract the text starting from Case 1 to diamond ◆ bold.
The next pdf contains the data in a different format pdf2. in this pdf I have to extract data from history to examination, then from examination to investigations with history and investigation as columns in an excel file and corresponding data in rows. and python regex cannot satisfy this condition because every pdf format is different and we want different type of text from different pdfs
apart from these types of pdf, I have 5+ different types of pdfs to process I have tried different python libraries like pdfminer, pdfplumber, PyMUPDF, pytesseract , textract, GROBID,
sample pdf:sample pdfs
code 1
import pdfplumber
import docx
file='Book_EM-Cases-Digest-Vol-2-Pediatric-Emergencies (1).pdf'
pdf=pdfplumber.open(file)
for page in pdf.pages:
text=page.extract_text()
code 2
import fitz
file='Book_EM-Cases-Digest-Vol-2-Pediatric-Emergencies (1).pdf'
docum=docx.Document()
with fitz.open(file) as doc:
for page in doc:
text=page.get_text()
the above codes will extract the text for the whole page. but I want specific text. I know we can also use python regex to do this but I have a variety of different pdfs as well and its become difficult to use python regex for all pdfs