I am trying to extract a certain section from HTML-files. To be specific, I look for the "ITEM 1" Section of the 10-K filings (a US business reports of a certain company). E.g.: https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002
Problem: However, I am not able to find the "ITEM 1" section, nor do I have an idea how to tell my algorithm to search from that point "ITEM 1" to another point (e.g. "ITEM 1A") and extract the text in between.
I am super thankful for any help.
Among others, I have tried this (and similar), but my bd
is always empty:
try:
# bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
# bd = soup.find_all(name="ITEM 1")
# bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])
print(" Business Section (Item 1): ", bd.content)
except:
print("\n Section not found!")
Using Python 3.7 and Beautifulsoup4
Regards Heka