I would like to parse old style EDGAR txt files from SEC containing different filings with free financial data, but it's very non trivial to parse a txt with a semblance of a table and extract this data.
Here is the link to the example file
I created a start of a program, but it's very flaky and needs a lot of tuning for different situations. Any other similar file but from year 2000 and not 1999 would fail if the length of data changes, the program will break. I'm not a programmer and I wonder if there is more robust and scalable way to parse this type of text files. Thanks
from bs4 import BeautifulSoup
import requests
fo_99 = requests.get("https://www.sec.gov/Archives/edgar/data/1067983/000095015099001240/0000950150-99-001240.txt")
soup_99 = BeautifulSoup(fo_99.text, "lxml")
tables_99 = soup_99.find_all('caption')
len(tables_99)
table = tables_99[1].find_all("s")
len(table)
_string = str(table[0]).split("\n")
for line in str(table[0]).split("\n"):
if len(line) > 11:
if not line.startswith(("<s>")):
print( line[0:25], "|",
line[25:30], "|",
line[30:43], "|" ,
line[43:55], "|" ,
line[55:66], "|",
line[66:72], "|",
line[72:76], "|",
line[76:87], "|",
line[87:109], "|",
line[109:121], "|",
line[121:128], "|",
line[128:], "|")
else:
print(line)