Repairing broken html table extracted with BS4 in python

Question

I am parsing html tables from administrative filings. It is tricky as the html is often broken and this results in poorly constructed tables. Here is an example of table that I load into a pandas dataframe:

                0   1    2     3   4         5  \
0             NaN NaN  NaN   NaN NaN       NaN   
1            Name NaN  Age   NaN NaN  Position   
2    Aylwin Lewis NaN  NaN  59.0 NaN       NaN   
3    John Morlock NaN  NaN  58.0 NaN       NaN   
4  Matthew Revord NaN  NaN  50.0 NaN       NaN   
5  Charles Talbot NaN  NaN  48.0 NaN       NaN   
6      Nancy Turk NaN  NaN  49.0 NaN       NaN   
7      Anne Ewing NaN  NaN  49.0 NaN       NaN   

                                                   6  
0                                                NaN  
1                                                NaN  
2    Chairman, Chief Executive Officer and President  
3    Senior Vice President, Chief Operations Officer  
4  Senior Vice President, Chief Legal Officer, Ge...  
5  Senior Vice President and Chief Financial Officer  
6  Senior Vice President, Chief People Officer an...  
7        Senior Vice President, New Shop Development

I wrote the following python code to try to repair the table:

#dropping empty rows
df = df.dropna(how='all',axis=0)

#dropping columns with more than 70% empty values
df = df.dropna(thresh =2, axis=1)

#resetting dataframe index
df = df.reset_index(drop = True)

#set found_name variable to stop the loop once it finds the name column
found_name = 0

#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():

    #only loop if we have not found a name column yet
    if found_name == 0: 

        #convert the row to string
        text_row = str(row)

        #search if there is the word "Name" in that row
        if "Name" in text_row:
            print("Name found in text of rows. Investigating row",row.Index," as header.")

            #changing column names
            df.columns = df.iloc[row.Index]

            #dropping first rows
            df = df.iloc[row.Index + 1 :]

            #changing found_name to 1
            found_name = 1

            #reindex
            df = df.reset_index(drop = True)
            print("Attempted to clean dataframe:")
            print(df)

And this is the table i get:

0            Name   NaN                                                NaN
0    Aylwin Lewis  59.0    Chairman, Chief Executive Officer and President
1    John Morlock  58.0    Senior Vice President, Chief Operations Officer
2  Matthew Revord  50.0  Senior Vice President, Chief Legal Officer, Ge...
3  Charles Talbot  48.0  Senior Vice President and Chief Financial Officer
4      Nancy Turk  49.0  Senior Vice President, Chief People Officer an...
5      Anne Ewing  49.0        Senior Vice President, New Shop Development

My main problem here is that the headers "Age" and "Position" have disappeared because they were misaligned with their columns. I am using this script to parse many tables, so I can't manually repair them. What could I do to repair the data at this point?

Stef · Accepted Answer · 2019-08-11T16:27:23.570

Don't drop nearly empty columns at the beginning, we need them later: once the header row containing "Name" is found, we collect all its non-empty elements to set them as column headers after dropping empty columns in the remaining data.

#dropping empty rows
df = df.dropna(how='all',axis=0)

#resetting dataframe index
df = df.reset_index(drop = True)

#set found_name variable to stop the loop once it finds the name column
found_name = 0

#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():

    #only loop if we have not found a name column yet
    if found_name == 0: 

        #convert the row to string
        text_row = str(row)

        #search if there is the word "Name" in that row
        if "Name" in text_row:
            print("Name found in text of rows. Investigating row",row.Index," as header.")

            #collect column names
            headers = [c for c in row if not pd.isnull(c)][1:]

            #dropping first rows
            df = df.iloc[row.Index + 1 :]

            #dropping empty columns
            df = df.dropna(axis=1)

            #setting column names
            df.columns = (headers + ['col'] * (len(df.columns) - len(headers)))[:len(df.columns)]

            #changing found_name to 1
            found_name = 1

            #reindex
            df = df.reset_index(drop = True)
            print("Attempted to clean dataframe:")
            print(df)

Result:

             Name   Age                                           Position
0    Aylwin Lewis  59.0    Chairman, Chief Executive Officer and President
1    John Morlock  58.0    Senior Vice President, Chief Operations Officer
2  Matthew Revord  50.0  Senior Vice President, Chief Legal Officer, Ge...
3  Charles Talbot  48.0  Senior Vice President and Chief Financial Officer
4      Nancy Turk  49.0  Senior Vice President, Chief People Officer an...
5      Anne Ewing  49.0        Senior Vice President, New Shop Development

Thank you so much for the help! I get the following error: ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements. What should I do? — user1029296, Aug 11 '19 at 14:27
You get this error if there are fewer non-empty entries in the found header row that there are columns in your remaining data set. In this case you can fill up the column name list with dummy names, e.g. `df.columns = headers + ['col'] * (len(df.columns) - len(headers))` instead of `df.columns = headers`. — Stef, Aug 11 '19 at 16:22
@user1029296: I updated my answer to cope with cases where there are too few or too many column headers found. — Stef, Aug 11 '19 at 16:28

Repairing broken html table extracted with BS4 in python

1 Answers1