2

I am trying to extract a certain section from HTML-files. To be specific, I look for the "ITEM 1" Section of the 10-K filings (a US business reports of a certain company). E.g.: https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002

Problem: However, I am not able to find the "ITEM 1" section, nor do I have an idea how to tell my algorithm to search from that point "ITEM 1" to another point (e.g. "ITEM 1A") and extract the text in between.

I am super thankful for any help.

Among others, I have tried this (and similar), but my bd is always empty:

    try:
        # bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
        # bd = soup.find_all(name="ITEM 1")
        # bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])

        print(" Business Section (Item 1): ", bd.content)

    except:
        print("\n Section not found!")

Using Python 3.7 and Beautifulsoup4

Regards Heka

Heka
  • 73
  • 1
  • 8
  • I believe it's easier to do it with xpath, which means not using beautifulsoup, but lxml. If you're interested, I can post an answer. – Jack Fleeting Dec 26 '19 at 02:08
  • Thanks for the answer. Would be nice if you could give me a hint for your lxml solution. I also tried it with that before, but couldn't manage. – Heka Jan 02 '20 at 10:06
  • I'm not sure what kind of hint you need. I can post an answer, as I suggested and you can test it. The answer worked on that particular filing, but the fundamental problem with all EDGAR filings is that they are not required to use uniform formatting, so each filer/edgarization provider formats them differently, which means many solutions work sometimes and sometimes they don't. It's just a fact of life with EDGAR... – Jack Fleeting Jan 02 '20 at 12:07
  • Ah, now I get it! Thanks. I'd be happy to try your solution! – Heka Jan 03 '20 at 13:18
  • See answer below. – Jack Fleeting Jan 03 '20 at 17:12
  • It's so strange. What's your version? pip show simplified_scrapy – dabingsou Jan 07 '20 at 07:46

2 Answers2

0

There are special characters. Remove them first

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[\s]+','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. BUSINESS'}

# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
  print (tr.TDs)

If you use the latest version, you can use the following methods

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
item1 = doc.getElementByReg('ITEM[\s]+1') # Incoming regex
print(item1,item1.text) # {'tag': 'B', 'html': 'ITEM\n    1. BUSINESS'} ITEM 1. BUSINESS

# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
  print (tr.TDs)
dabingsou
  • 2,469
  • 1
  • 5
  • 8
  • Hey, thanks for your reply. I tried it out, but `item1` is always `None`. I think the `doc.getElementByText('ITEM 1')`can't find the text, even if I replace it with `doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[^\S]+1','ITEM '))`, if I correctly understand the code. – Heka Jan 02 '20 at 10:14
  • Thanks, I checked again, but I still get `None` for the `ìtem1`. – Heka Jan 03 '20 at 13:27
  • I have Version: 0.8.91. i also did `pip install --upgrade simplified_scrapy` but it's already up to date! – Heka Jan 08 '20 at 10:17
  • Sorry, I can't help you. I have no problem here. I don't know what's wrong. – dabingsou Jan 10 '20 at 02:54
  • No worries, your code helped and clarified anyway, meaning I learned something, still. Thanks! – Heka Jan 10 '20 at 10:45
0

As I mentioned in a comment, because of the nature of EDGAR, this may work on one filing but fail on another. The principles, though, should generally work (after some adjustments...)

import requests
import lxml.html

url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)

tabs = doc.xpath('//table[./tr/td/font/a[@name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a 
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
    if flag == 'stop':
        break
    if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
        print(i.text_content().strip().replace('\n',''))
    nxt = i.getparent().getnext()
    #the following detects when the <p> tags of Item 1 end and the next Item begins and then stops 
    if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
        for j in nxt.iterdescendants():
           if j.tag == 'a' and j.values()[0]=='a_003':
                 # we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
                 flag='stop'           

The output is the text of Item 1 in this filing.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45