How to count the number of words from a list from a text extract in a pdf using Python?

Question

I am trying to count a serie of words extract from a PDF but I get only 0 and it is not correct.

total_number_of_keywords = 0
pdf_file = "CapitalCorp.pdf"
tables=[]

words = ['blank','warrant ','offering','combination ','SPAC','founders']
count={} # is a dictionary data structure in Python


with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for i,pg in enumerate(pages):
        tbl = pages[i].extract_tables()
        for elem in words:
            count[elem] = 0
        for line in f'{i} --- {tbl}' :
            elements = line.split()
            for word in words:
                count[word] = count[word]+elements.count(word)
print (count)

SilentCloud · Accepted Answer · 2021-10-08T09:53:17.057

1

This will do the job:

import pdfplumber
pdf_file = "CapitalCorp.pdf"
words = ['blank','warrant ','offering','combination ','SPAC','founders']

# Get text
text = ''
with pdfplumber.open(pdf_file) as pdf:
    for i, page in enumerate(pdf.pages):
        text = text+'\n'+str(page.extract_text())

# Setup count dictionary
count = {}
for elem in words:
    count[elem] = 0
        
# Count occurences
for i, el in enumerate(words):
    count[f'{words[i]}'] = text.count(el)

First, you store the content of PDF in the variable text, which is a string.

Then, you setup the count dictionary, with one key fo every element of words and respective values to 0.

Last, you count the occurrences of every element of words in text with the count() method and store it in the respective key of your count dictionary.

edited Oct 08 '21 at 09:53

answered Oct 08 '21 at 09:05

SilentCloud

1,677
3
9
28

Thank you :) this code is working but I am missing some words. For example I am suppose to find 281 time "warrant" in the pdf file but with your code I find it only 65 times. – Math4264 Oct 08 '21 at 09:32
Can you share the PDF? I made an edit removign the lower() method: try that now – SilentCloud Oct 08 '21 at 09:53
I have download the pdf from here: https://sec.report/Document/0001144204-14-043307/v383897_424b4.htm – Math4264 Oct 08 '21 at 10:14
when I remove lower() I miss a few more arguments – Math4264 Oct 08 '21 at 10:15
I will look into that later, but be careful to whitespaces and capital letters, the problem is probably some detail about that – SilentCloud Oct 08 '21 at 10:27
1

Yes the problem was due to blank space! Thank you very much for your help! – Math4264 Oct 08 '21 at 12:10

How to count the number of words from a list from a text extract in a pdf using Python?

1 Answers1