1

I am trying to parse the text section of the SEC Edgar texts in Python 3, e.g.: https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt

My goal is to collect the number of occurrences in the visible text body of the 10-K statements of certain keywords and save them to a dictionary (i.e., I am not interested in any tables, exhibits, etc.).

I am very new to Python and would appreciate any help!

This is what I have written so far, but here the code doesn't return the right number of occurrences and it does not capture the main text body visible to the end user.

import requests
from bs4 import BeautifulSoup

# this part I would like to change such that it only collects words visible to the normal user in the page (is that the body?) 

def count_words(url, the_word):
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find(text=lambda text: text and the_word in text)
    print(words)
    print('*'*20)
    return len(words)


def main():
    url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
    word_list = ['assets']
    for word in word_list:
        count = count_words(url, word)
        print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
        print('--'*20)

# this part I dont understand 
if __name__ == '__main__':
    main()
dernuco
  • 15
  • 3
  • For the example case, what is the correct number of occurrences? What occurrences does your code find that it shouldn't? Or not find that it should? – Michael Dyck Apr 11 '20 at 17:22
  • Hi Michael, thank you for looking into this! I tried with the word "digital" and it returns 382 matches when you just open the link and use the search function in the browser. However, when I open the end user-friendly view (here: https://www.sec.gov/Archives/edgar/data/796343/000079634314000004/adbe10kfy13.htm), I only get 176 returns, which is the number I am interested in. Also, there are some parts in the .txt link that I am not sure if they are translated properly - they look weird as the code only covers approx. a third of the page and only consists of special characters – dernuco Apr 12 '20 at 11:25
  • If, in your sample code, you change 'assets' to 'digital', the output says "contains 493 occurrences of word: digital", but 493 is neither 382 nor 176. So your code isn't counting occurrences. Instead, because you're using 'soup.find', it only finds the first element-text that satisfies your lambda, and the variable 'words' holds that text as a string, and len(words) just returns the length of that string. (I.e., it contains 493 characters.) You need to use `soup.find_all`. (However, this doesn't answer the question you actually asked, so I'm leaving it as a comment.) – Michael Dyck Apr 12 '20 at 15:41

0 Answers0