How to iterate through pdf files and find the occurrences of a same list of specific words in each file?

Question

I need help to find a list of specific words in many pdf files using python. For example, I want to find the occurrences of words "design" and "process" in two pdf files.

The following is my code:

output = []
count = 0
for fp in os.listdir(path):

    pdfFileObj = open(os.path.join(path, fp), 'rb')
    reader = PdfReader(pdfFileObj)
    number_of_pages = len(reader.pages)
    
    for i in range(number_of_pages):
        page = reader.pages[i]

        output.append(page.extract_text())
        text = str(output)
       
    
    words = ['design','process']
    count = {}
    for elem in words:
        count[elem] = 0
            
    # Count occurences
    for i, el in enumerate(words):
        count[f'{words[i]}'] = text.count(el)
    
    print(count)

The code output is: {'design': 112, 'process': 31} {'design': 195, 'process': 56}

The first count is right, since the first pdf file does have 112 "design" and 31 "process". However, the second count is not right. There are 83 "design" and 25 "process" in the second pdf but the output values are much larger than them.

My expected output is: {'design': 112, 'process': 31} {'design': 83, 'process': 25}

I found that if the second count minus the first count (195-112 = 83, 56-31 = 25), then the values are correct. I don't know how to fix the code, could someone please help me? Thank you so much.

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Сергей Кох, Mar 04 '23 at 19:38
keep in mind that text extraction may not be reliable, use for ex a regex to validate the result — cards, Mar 04 '23 at 19:57

score 1 · Accepted Answer · answered Mar 04 '23 at 19:02

1

You neglected to reset the list output when you advance to the next file. As you point out, the second set of numbers is the expected counts plus the counts from the first file.

Set output = [] at the top of the body of the main for-loop, not above it.

answered Mar 04 '23 at 19:02

alexis

48,685
16
101
161

How to iterate through pdf files and find the occurrences of a same list of specific words in each file?

1 Answers1