I am trying to create a program that analyzes SEC 10K reports and puts them in a readable text file. So far, I've been successful getting the HTML from their API (sec-api) and downloading that locally (let me know if there's a better way to get the documents).
The problem is the HTML, especially the tabular data, is formatted in a difficult to read format:
<tr>
<td colspan="3" style="padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:8pt;font-weight:700;line-height:100%">Period</span>
</td>
<td colspan="3" style="padding:0 1pt"></td>
<td colspan="3" style="padding:2px 1pt;text-align:left;vertical-align:bottom">
<div style="text-align:center"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:8pt;font-weight:700;line-height:100%">Total
Number of Class C Shares Purchased </span></div>
<div style="text-align:center"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:8pt;font-weight:700;line-height:100%">(in
thousands)</span><span style="color:#000000;font-family:'Arial',sans-serif;font-size:5.2pt;font-weight:700;line-height:100%;position:relative;top:-2.8pt;vertical-align:baseline">(1)</span>
</div>
</td>
</tr>
<tr>
<td colspan="3" style="background-color:#cceeff;padding:2px 1pt;text-align:left;vertical-align:bottom">
<span style="color:#000000;font-family:'Arial',sans-serif;font-size:10pt;font-weight:400;line-height:100%">October
1 - 31</span></td>
<td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td>
<td colspan="2" style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:right;vertical-align:bottom">
<span style="color:#000000;font-family:'Arial',sans-serif;font-size:10pt;font-weight:400;line-height:100%">8,585 </span>
</td>
</tr>
Preferably, I would like to have it in a format that looks like this:
Number of Class C Shares Purchased (in thousands) from October 1 - 31: 8,585
I'm using typescript, so preferably I would like the solution to utilize that.
I've tried using multiple different APIs already, but I've been a bit unsuccessful trying to parse them. Almost all of the APIs offered by the SEC are unhelpful at parsing the HTML or retrieving data based off of the CIK number.