Access link href using BeautifulSoup in Python

Question

I am looking for help webscraping the SEC's EDGAR database using BeautifulSoup. I have a list of investment firm names that I am trying to iterate through, and ultimately access their 13F filings.

So far, using BeautifulSoup, I am able to specify an entry, but am having trouble finding a way to put together the SEC's base web url with a specific file to actually access the data.

My code so far looks like:

headers = {"user-agent": 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0'}

for i in firms: # pre-determined list, but using IFP Advisors for this example as 'i'
    edgar_url = r'https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3D13F-HR+and+company-name+%3D+%22' + i + '%22&first=2020&last=2021&output=atom'
    
    response = requests.get(url = edgar_url, headers = headers)
    soup = BeautifulSoup(response.content, 'lxml')
    entries = soup.find_all('entry')

which gets me to a list of specific 13F filing entries.

   <entry>
      <title>13F-HR - IFP Advisors, Inc</title>
      <link rel="alternate" type="text/html" href="/Archives/edgar/data/1641866/000164186621000007/0001641866-21-000001-index.htm"/>
      <summary type="html">&lt;b&gt;Filed Date:&lt;/b&gt; 01/25/2021 &lt;b&gt;Accession Number:&lt;/b&gt; 0001641866-21-000001 &lt;b&gt;Size:&lt;/b&gt; 4 MB</summary>
      <updated>01/25/2021</updated>
      <category scheme="http://www.sec.gov/" label="form type" term="4"/>
      <id>urn:tag:sec.gov,2008:accession-number=0001641866-21-000001</id>
   </entry>

Eventually, what I would be looking to do is pull out the href dictated above

/Archives/edgar/data/1641866/000164186621000007/0001641866-21-000007-index

and pair it with the scheme in the entry to access the 13F filing's text file, which can be found here: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007.txt

While I have the scheme designated, I am looking for a solution to pulling in the link href from each entry to create a new url to access more data.

Any help or suggestions would be appreciated. Thank you in advance!

Andrej Kesely · Answer 1 · 2021-09-14T22:42:10.713

To get URLs for complete submissions you can use this example:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}

firms = [
    "IFP Advisors, Inc",
]

entries = []
for i in firms:
    edgar_url = (
        r"https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%3D13F-HR+and+company-name+%3D+%22"
        + i
        + "%22&first=2020&last=2021&output=atom"
    )
    response = requests.get(url=edgar_url, headers=headers)
    soup = BeautifulSoup(response.content, "lxml")
    entries.extend(soup.find_all("entry"))

for e in entries:
    url = "https://www.sec.gov" + e.link["href"]
    print("Getting URL:", url)
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    l = soup.select_one(
        'td:-soup-contains("Complete submission text file") + td a'
    )
    submission_url = "https://www.sec.gov" + l["href"]
    print("Complete submission text file:", submission_url)
    print()

Prints:

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000005/0001641866-21-000005-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000005/0001641866-21-000005.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000004/0001641866-21-000004-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000004/0001641866-21-000004.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000001/0001641866-21-000001-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186621000001/0001641866-21-000001.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000007/0001641866-20-000007.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000006/0001641866-20-000006-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000006/0001641866-20-000006.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000002/0001641866-20-000002-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000002/0001641866-20-000002.txt

Getting URL: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000001/0001641866-20-000001-index.htm
Complete submission text file: https://www.sec.gov/Archives/edgar/data/1641866/000164186620000001/0001641866-20-000001.txt

Thanks for the response, Andrej. When I tried to execute what you posted above, I received an error message: NotImplementedError: ':-soup-contains' pseudo-class is not implemented at this time Any ideas on what could be going wrong? It appears that the "getting url" segement worked on my end, but the "complete submission text file url" did not. I was still able to get to my ultimately desired result by using replace on the "getting url" segment to replace "-index.htm" with ".txt" — therdawg, Sep 15 '21 at 14:21
@user14222854 Consider upgrading version of `bs4` to the latest. I'm on `4.10.0` — Andrej Kesely, Sep 15 '21 at 15:21

Access link href using BeautifulSoup in Python

1 Answers1