I need help to find a list of specific words in many pdf files using python. For example, I want to find the occurrences of words "design" and "process" in two pdf files.
The following is my code:
output = []
count = 0
for fp in os.listdir(path):
pdfFileObj = open(os.path.join(path, fp), 'rb')
reader = PdfReader(pdfFileObj)
number_of_pages = len(reader.pages)
for i in range(number_of_pages):
page = reader.pages[i]
output.append(page.extract_text())
text = str(output)
words = ['design','process']
count = {}
for elem in words:
count[elem] = 0
# Count occurences
for i, el in enumerate(words):
count[f'{words[i]}'] = text.count(el)
print(count)
The code output is: {'design': 112, 'process': 31} {'design': 195, 'process': 56}
The first count is right, since the first pdf file does have 112 "design" and 31 "process". However, the second count is not right. There are 83 "design" and 25 "process" in the second pdf but the output values are much larger than them.
My expected output is: {'design': 112, 'process': 31} {'design': 83, 'process': 25}
I found that if the second count minus the first count (195-112 = 83, 56-31 = 25), then the values are correct. I don't know how to fix the code, could someone please help me? Thank you so much.