REGEX extract information from EDGAR SC-13 form

Question

I am trying to extract information from the latest SEC EDGAR Schedule 13 forms filings.

Link of the filing as an example:

1) Saba Capital_27-Dec-2019_SC13

The information I am trying to extract (and the parts of the filing with the information)

1) Names of reporting persons: Saba Capital Management, L.P.

<p style="margin-bottom: 0pt;">NAME OF REPORTING PERSON</p>
<p style="margin-top: 0pt; margin-left: 18pt;">Saba Capital Management GP, LLC<br><br/>

2) Name of issuer : WESTERN ASSET HIGH INCOME FUND II INC

<p style="text-align: center;"><b><font size="5"><u>WESTERN ASSET HIGH INCOME FUND II INC.</u></font><u><br/></u>(Name of Issuer)</b>

3) CUSIP Number: 95766J102 (managed to get)

<p style="text-align: center;"><b><u>95766J102<br/></u>(CUSIP Number)</b>

4) Percentage of class represented by amount : 11.3% (managed to get)

<p style="margin-bottom: 0pt;">PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)</p>
<p style="margin-top: 0pt; margin-left: 18pt;">11.3%<br><br/>

5) Date of Event Which requires filing of this statement: December 24, 2019

<p style="text-align: center;"><b><u>December 24, 2019<br/></u>(Date of Event Which Requires Filing of This Statement)</b>

.

import requests 
import re
from bs4 import BeautifulSoup

page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'xml')

## get CUSIP number
CUSIP = re.findall(r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]', soup.text)

### get % 
regex = r"(?<=PERCENT OF CLASS|Percent of class)(.*)(?=%)"
percent = re.findall(r'\d+.\d+', re.search(regex, soup.text, re.DOTALL).group().split('%')[0])

How can I extract the 5 pieces of information from the filing? Thanks in advance

CUSIP = `soup.select("body > document > type > sequence > filename > description > text > p:nth-child(9) > b > u")get_text()` — Manali Kagathara, Dec 30 '19 at 12:02
kindly confirm if all items in sames positions in case if you will apply the code on multiple pages. — αԋɱҽԃ αмєяιcαη, Dec 30 '19 at 12:34
I wouldn't use regex for html under any circumstances (search around to see why). The easiest way to do it is with xpath, which would require using lxml instead of beautifulsoup. I you're are interested, I can post an answer. — Jack Fleeting, Dec 30 '19 at 12:38
@αԋɱҽԃαмєяιcαη I dont think all the items are in the same position. Yes, I want to apply it to other pages — Lko, Dec 30 '19 at 12:52

KunduK · Accepted Answer · 2019-12-30T13:32:10.473

Try the following Code to get all the values.Using find() and css selector select_one()

import requests
import re
from bs4 import BeautifulSoup

page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-child(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-child(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-child(11) > b > u").text.strip()
print(Dateof)

Output:

Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

UPDATED

If you don't want to use position then try below one.

import requests
import re
from bs4 import BeautifulSoup

page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:contains(Issuer)').find_next('u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one('p:contains(CUSIP)').find_next('u').text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one('p:contains(Event)').find_next('u').text.strip()
print(Dateof)

Output:

Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

Update 2:

import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-of-type(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-of-type(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-of-type(11) > b > u").text.strip()
print(Dateof)

I can't get the soup.select_one for the name of issuer, CUSIP and Date to work. If the position of the items change, these 3 may not work right ? — Lko, Dec 30 '19 at 12:53
NotImplementedError Traceback (most recent call last) in () ----> 9 NameOftheIssuer=soup.select_one('p:contains(Issuer)').find_next('u').text.strip() /usr/local/lib/python3.6/dist-packages/bs4/element.py in select(self, selector, _candidate_generator, limit) 1526 else: 1527 raise NotImplementedError( -> 1528 'Only the following pseudo-classes are implemented: nth-of NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type. — Lko, Dec 30 '19 at 13:15
@Lko : I would suggest use the updated code where you don't need to use position.Have you tried the answer under updated option. — KunduK, Dec 30 '19 at 13:27
Dont know why you are getting error.Can you try Update 2 option. — KunduK, Dec 30 '19 at 13:32
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/205082/discussion-between-lko-and-kunduk). — Lko, Dec 30 '19 at 13:40
@KunduK is there a way to make the "Issuer" in this soup.select_one('p:contains(Issuer)') less case sensitive so that it searches for ['Issuer', 'ISSUER', 'issuer'] ? — Lko, Jan 01 '20 at 06:06
@KunduK is there a way to search for multiple tags at the same time

because the document formating may be different — Lko, Jan 01 '20 at 06:07

score 0 · Answer 2 · answered Dec 30 '19 at 13:15

0

Using lxml, it should work this way:

import requests
import lxml.html

url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
source = requests.get(url)

doc = lxml.html.fromstring(source.text)
name = doc.xpath('//p[text()="NAME OF REPORTING PERSON"]/following-sibling::p/text()')[0]
issuer = doc.xpath('//p[contains(text(),"(Name of Issuer)")]//u/text()')[0]
cusip = doc.xpath('//p[contains(text(),"(CUSIP Number)")]//u/text()')[0]
perc = doc.xpath('//p[contains(text(),"PERCENT OF CLASS REPRESENTED")]/following-sibling::p/text()')[0]
event = doc.xpath('//p[contains(text(),"(Date of Event Which Requires")]//u/text()')[0]

Output:

Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

answered Dec 30 '19 at 13:15

Jack Fleeting

24,385
6
23
45

this works on this page. Can you explain how can i apply to other pages (from the same type of filing) example: https://www.sec.gov/Archives/edgar/data/1061983/000089914019000648/p32629470a.htm – Lko Dec 30 '19 at 13:26
@Lko - Unfortunately, that's exactly one of the main problems with parsing EDGAR filings. The SEC doesn't require a uniform edgarization method, so each filer edgarizes its filings (or the edgarization service provider does) differently. In many cases, filings from the same filer use the same edgarization format, but you still have to do some manual inspection of the filing to figure out where things hide... – Jack Fleeting Dec 30 '19 at 13:31
ic, can you explain '//u/text()' and '/following-sibling::p/text()' or any resources that i can find out more ? – Lko Dec 30 '19 at 13:50
@Lko - There's quite a lot of info about xpath, though it may take a while to grasp. Good place to start is: https://www.w3schools.com/xml/xpath_intro.asp – Jack Fleeting Dec 30 '19 at 13:54

REGEX extract information from EDGAR SC-13 form

2 Answers2