Reading loosely formatted HTML data tables from SEC filings into pandas dataframe in Python

Question

I am trying to read tables from SEC Edgar filings into pandas dataframe. I initially tried the standard read_html solution from pandas with rather poor results. I Would appreciate on what would be best approaches on reading loosely formatted HTML tables into pandas dataframe. Especially if we could assign multi-index values on for example column headers.

Most of the data can be read quite well using the following example provided by Igniris in this topic but I have some issues with tables which have multi-index in column headers.

Source datais as follows: https://www.sec.gov/ix?doc=/Archives/edgar/data/1820721/000182072122000146/arry-20220930.htm

Example of Table data below.

<table style="border-collapse:collapse;display:inline-table;margin-bottom:5pt;vertical-align:text-bottom;width:100.000%"><tbody><tr><td style="width:1.0%"></td><td style="width:34.455%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:14.594%"></td><td style="width:0.1%"></td><td style="width:0.1%"></td><td style="width:0.355%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:14.594%"></td><td style="width:0.1%"></td><td style="width:0.1%"></td><td style="width:0.355%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:14.594%"></td><td style="width:0.1%"></td><td style="width:0.1%"></td><td style="width:0.355%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:14.598%"></td><td style="width:0.1%"></td></tr><tr><td colspan="3" style="padding:0 1pt"></td><td colspan="9" style="padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:700;line-height:107%">September 30, 2022</span></td><td colspan="3" style="padding:0 1pt"></td><td colspan="9" style="padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:700;line-height:107%">December 31, 2021</span></td></tr><tr><td colspan="3" style="padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:700;line-height:100%">Carrying Value</span></td><td colspan="3" style="border-top:1pt solid #000;padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:700;line-height:100%">Fair Value</span></td><td colspan="3" style="padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:700;line-height:100%">Carrying Value</span></td><td colspan="3" style="border-top:1pt solid #000;padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:700;line-height:100%">Fair Value</span></td></tr><tr><td colspan="3" style="background-color:#cceeff;padding:2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%">Convertible Notes</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><span><ix:nonfraction unitref="usd" contextref="ibfb2454929c642babce2aee0f23239d5_I20220930" decimals="-3" name="us-gaap:LongTermDebt" format="ixt:num-dot-decimal" scale="3" id="fact-identifier-928" inside-table="true" data-original-id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV84Mi9mcmFnOjc5YTA2NmI4YTc3NDRhMjQ5ZTdiNDE2MGQ2YTM5MzhjL3RhYmxlOjBhZGRlYTc2ZmVlZjQ0ZDZiYjAyZGNkYzVhZjA2Y2Y0L3RhYmxlcmFuZ2U6MGFkZGVhNzZmZWVmNDRkNmJiMDJkY2RjNWFmMDZjZjRfMi0xLTEtMS04NTQ3Nw_15096ad6-16c9-4d9e-98f5-8e0dbe88b526" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isamountsonly="true" istextonly="false" iscalculationsonly="false" isnegativesonly="false" isadditionalitemsonly="false" isstandardonly="true" iscustomonly="false">413,279</ix:nonfraction></span>&nbsp;</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><span><ix:nonfraction unitref="usd" contextref="ibfb2454929c642babce2aee0f23239d5_I20220930" decimals="-3" name="us-gaap:LongTermDebtFairValue" format="ixt:num-dot-decimal" scale="3" id="fact-identifier-929" inside-table="true" data-original-id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV84Mi9mcmFnOjc5YTA2NmI4YTc3NDRhMjQ5ZTdiNDE2MGQ2YTM5MzhjL3RhYmxlOjBhZGRlYTc2ZmVlZjQ0ZDZiYjAyZGNkYzVhZjA2Y2Y0L3RhYmxlcmFuZ2U6MGFkZGVhNzZmZWVmNDRkNmJiMDJkY2RjNWFmMDZjZjRfMi0zLTEtMS04NTQ3Nw_c4cea1e2-a93c-445b-946e-b038220e473b" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isamountsonly="true" istextonly="false" iscalculationsonly="false" isnegativesonly="false" isadditionalitemsonly="false" isstandardonly="true" iscustomonly="false">381,625</ix:nonfraction></span>&nbsp;</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><span><ix:nonfraction unitref="usd" contextref="i78e7a46ec35641a19e71be9161d7aacb_I20211231" decimals="-3" name="us-gaap:LongTermDebt" format="ixt:num-dot-decimal" scale="3" id="fact-identifier-930" inside-table="true" data-original-id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV84Mi9mcmFnOjc5YTA2NmI4YTc3NDRhMjQ5ZTdiNDE2MGQ2YTM5MzhjL3RhYmxlOjBhZGRlYTc2ZmVlZjQ0ZDZiYjAyZGNkYzVhZjA2Y2Y0L3RhYmxlcmFuZ2U6MGFkZGVhNzZmZWVmNDRkNmJiMDJkY2RjNWFmMDZjZjRfMi01LTEtMS04NTQ3Nw_8842c34d-3e3b-4c21-bf58-073b3c354da8" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isamountsonly="true" istextonly="false" iscalculationsonly="false" isnegativesonly="false" isadditionalitemsonly="false" isstandardonly="true" iscustomonly="false">411,863</ix:nonfraction></span>&nbsp;</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:'Arial',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><span><ix:nonfraction unitref="usd" contextref="i78e7a46ec35641a19e71be9161d7aacb_I20211231" decimals="-3" name="us-gaap:LongTermDebtFairValue" format="ixt:num-dot-decimal" scale="3" id="fact-identifier-931" inside-table="true" data-original-id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV84Mi9mcmFnOjc5YTA2NmI4YTc3NDRhMjQ5ZTdiNDE2MGQ2YTM5MzhjL3RhYmxlOjBhZGRlYTc2ZmVlZjQ0ZDZiYjAyZGNkYzVhZjA2Y2Y0L3RhYmxlcmFuZ2U6MGFkZGVhNzZmZWVmNDRkNmJiMDJkY2RjNWFmMDZjZjRfMi03LTEtMS04NTQ3Nw_ca7a5221-48cc-4b6e-b8cd-064514d8262e" continued-taxonomy="false" enabled-taxonomy="true" highlight-taxonomy="false" selected-taxonomy="false" hover-taxonomy="false" onclick="Taxonomies.clickEvent(event, this)" onkeyup="Taxonomies.clickEvent(event, this)" onmouseenter="Taxonomies.enterElement(event, this);" onmouseleave="Taxonomies.leaveElement(event, this);" tabindex="18" isamountsonly="true" istextonly="false" iscalculationsonly="false" isnegativesonly="false" isadditionalitemsonly="false" isstandardonly="true" iscustomonly="false">410,771</ix:nonfraction></span>&nbsp;</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td></tr><tr><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td></tr><tr><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td></tr></tbody></table>

Sample of the data table in code snippet:

Sample data table with three levels of column hierarchy

The best method so far was the solution by Igniris. At first I read the data into Python and then apply solution used by Igniris into data.

import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize
import re
import html_table_extractor
import pandas as pd

### Reading the sample table into a file ###

text_file = open("test_data.html", "r", encoding="utf8")

#read whole file to a string
data = text_file.read()

#close file
text_file.close()

### Changing name of the "data" to resp ####
resp = data

### Below is the solution provided by Igniris ####

Extractor(resp)
ex = Extractor(resp)
ex.parse()
list_of_lines = ex.return_list()

# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)

ex.parse()
list_of_lines = ex.return_list()

# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)

## we need to consolidate some columns

# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitions
for el in list_of_lines[names_line]:
    col_names[el] = [i for i, x in enumerate(list_of_lines[names_line]) if x == el]

# now consolidate repetitive values
storage = OrderedDict() # this will contain columns
for k in col_names:
    res = []
    for line in list_of_lines[names_line+1:]:  # first 2 lines are empty, third is column names
        joined = [] # <- this list will accumulate *unique* values to become a single cell
        for idx in col_names[k]:
            el = line[idx]
            if joined and joined[-1]==el:   # if value already exist, skip
                continue
            joined.append(el)   # add unique value to cell
        res.append(''.join(joined))   # add cell to column
    storage[k] = res   # add column to storage
df = pd.DataFrame(storage)
print(df)

By using the standard solution we get the following results on the two level column hierarchy carrying value and fair value columns are concatenated for both time dimensions on level 1 column names.

By using the standard solution above on the test data table we get the following results on three level column hierarchy.

By changing the value on "names_line" we can define on which rows contain the column names. If we change it to either 1 or 3 we get the following results.

By using "names_line = 3" we capture most of the data but some columns are concatenated. Therefore the preferred solution would be to be able to use multiple column indexes if possible depending on the table which is read.

Here is another example with little bit more simple structure.

With "names_line = 1" to get the first level of columns

With the "names_line = 2" to format the data using columns from second row.

For most of the cases number of columns can be most likely determined by checking out how many empty rows there are on first column. Though as SEC filings have quite a bit different formats this might not always work but would most likely allow reading most of them to pandas.

My approach would be to first

determine how many column header rows there are
make multi-index out of them to get cleaner structure for the data

For example one solution might be just checking out how many NaN row's there are on first column

test_html = pd.read_html("test_data.html")
test_html[0].isna().cummin().sum()[0]

I also did try to modify the original code to add delimiters by changing the following line by adding | as separator.

res.append('|'.join(joined))   # add cell to column

But table structure should be most likely cleaned further to avoid separate columns for example currency symbols such as $.

But honestly this is where I cant get much forward. Therefore any modifications to this code for the approach mentioned above or easier solutions would be appreciated.

Have you looked into this `XBRL` format you can save the file in? It looks to be some modified form of XML. First time I've heard of it - but it seems like there some parsers for it which may be easier than dealing with the raw html. — jqurious, Mar 12 '23 at 18:37

Reading loosely formatted HTML data tables from SEC filings into pandas dataframe in Python

0 Answers0