I've used 'fitz' from Pymupdf module to extract data and then with pandas converting the extracted data to dataframe.
#Code to read multiple pdfs from the folder:
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]
#Code to extract the data:
for pdf in pdf_files:
with fitz.open(pdf) as doc:
pypdf_text = ""
for page in doc:
pypdf_text += page.getText()
But, the above code is only extracting the data for last pdf in the folder. and thus giving the result for only that pdf Although, the desired goal is to extract the data from all the pdfs in the folder one by one
Please help me understand and resolved why is this happening??