Scraping info out of pdf's using Python

Question

I have pdf's distributed over several folders and sub folders. I've been trying to write a short python script with the idea to search each pdf for any term i enter. As not all pdf's are searchable, I also tried to implement a list of searchable, and non searchable pdf's with the idea to bring everything in line.

The program seems to work, up to a point. The longer it runs, the slower it goes.

At a certain moment, it just stops. I think it is a memory issue, but i can't seem to find a solution.

The script i have already:

import os
# extracting_text.py
from PyPDF2 import PdfFileReader

search_word = input("enter a word you want to search in file: ")
counter = 0
noTextCounter = 0
SolutionCounter = 0

with open("Solutions.txt", "w") as text_file:
text_file.writelines(f"List of files that contain: {search_word}")
#print(f"List of files that contain: {search_word}", file=text_file)

def text_extractor(path):    

with open(path, 'rb') as f:
    #variable to find pdf's that only have image. If activated countempty has to be included in the return.
    countEmpty = 0
    countSolution = 0

    pdf = PdfFileReader(f)
    # get the first page
    page = pdf.getPage(0)
    # print(page)
    # print('Page type: {}'.format(str(type(page))))
    text = page.extractText()
    
    if text == '':
        print('No text')
        countEmpty = countEmpty + 1
    else:
        if search_word in text:
            print("word found")
            countSolution = countSolution + 1
        else:
            print("word not found")
    # print(text)
    #Selection of potential returns
    #return countEmpty
    return countSolution


root = os.getcwd()

try:
for subdir, dirs, files in os.walk(root):
    for file in files:
        # print os.path.join(subdir, file)
        filepath = subdir + os.sep + file
        if filepath.endswith(".pdf"):
            print(filepath)
            counter = counter + 1
            print(counter)
            if __name__ == '__main__':                
                path = filepath
                indicator = text_extractor(path)
                #noTextCounter = noTextCounter + indicator
                SolutionCounter = SolutionCounter + indicator
                print("indicator: " + str(indicator))
                if indicator == 1:
                    with open("Solutions.txt", "a") as text_file:
                        text_file.writelines('\n' + path)       

                #below is option to give 2 lists containing all the pdf's which are images and a list of non images             
                    # #with open("ListOfImagePdfs.txt", "a") as text_file:
                    #    text_file.writelines('\n' + path)
                #else:
                    #with open("ListOfDataPdfs.txt", "a") as text_file:
                    #    text_file.writelines('\n' + path)             
                               
                #print("amount of image pdf's: " + str(noTextCounter))

except:
 pass
#trycatch to be added

Can you reformat the code? Does it get slower with every pdf file read? How many files are you trying to check? — mx0, Feb 18 '22 at 10:22
total amount of files to search are around 3500. Distributed over about 20 folders/subfolders. It starts slowing down after about 20 checks, then again at about 150, and stops at 599. — OldGrey, Feb 18 '22 at 11:16
This would be quite easy if you were searching a bunch of MS Word files using VBA, rather than PDF files. — ASH, Apr 05 '22 at 19:43

Scraping info out of pdf's using Python

0 Answers0