I am trying to parse the text section of the SEC Edgar texts in Python 3, e.g.: https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt
My goal is to collect the number of occurrences in the visible text body of the 10-K statements of certain keywords and save them to a dictionary (i.e., I am not interested in any tables, exhibits, etc.).
I am very new to Python and would appreciate any help!
This is what I have written so far, but here the code doesn't return the right number of occurrences and it does not capture the main text body visible to the end user.
import requests
from bs4 import BeautifulSoup
# this part I would like to change such that it only collects words visible to the normal user in the page (is that the body?)
def count_words(url, the_word):
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
print('*'*20)
return len(words)
def main():
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
word_list = ['assets']
for word in word_list:
count = count_words(url, word)
print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
print('--'*20)
# this part I dont understand
if __name__ == '__main__':
main()