I'm trying to read pdf one by one and then converting it into dataframe

Question

I've used 'fitz' from Pymupdf module to extract data and then with pandas converting the extracted data to dataframe.

#Code to read multiple pdfs from the folder:

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list

pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

#Code to extract the data:

for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        pypdf_text = ""
        for page in doc:
            pypdf_text += page.getText()

But, the above code is only extracting the data for last pdf in the folder. and thus giving the result for only that pdf Although, the desired goal is to extract the data from all the pdfs in the folder one by one

Please help me understand and resolved why is this happening??

Define `pypdf_text` before `for pdf in pdf_files` loop. You rewrite it with empty string each time so lose text from previous pdf file. — Yevhen Kuzmovych, Jan 25 '22 at 13:53

score 0 · Answer 1 · edited Jan 26 '22 at 09:20

0

Change the below code:

Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")

to

files_pdf = [ file for file in glob.glob(path+"\*.pdf",recursive=True)]

and give path as a variable.

edited Jan 26 '22 at 09:20

Pini Cheyni

5,073
2
40
58

answered Jan 26 '22 at 04:17

Sai Goutam

1

score 0 · Answer 2 · answered Jan 26 '22 at 05:55

Following code worked for me,

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list

pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

#Code to extract the data:

pdf_txt = ""
for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        
        for page in doc:
            pdf_txt += page.getText()

#Converting the extracted data to data frame:

with open('pdf_txt.txt','w', encoding='utf-8') as f: #Converting to text file
    f.write(pdf_txt)

data=pd.read_table('pdf_txt.txt',sep='\n')  #Converting text file to dataframe

Thank you @Yevhen Kuzmovych for your help!

I'm trying to read pdf one by one and then converting it into dataframe

2 Answers2