pandas read_html clean up before or after read

Question

I'm trying to get the last table in this html into a data table.

Here is the code:

import pandas as pd
a=pd.read_html('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')
print (a[23])

As you can see it reads it in, but needs to be cleaned up. My question is for someone who has experience with using this function. Is it better to read it in and then try to clean it up afterwards or before? And if anybody knows how to do it, please post some code. Thanks.

Dont you think that the use of a parser like BeautifulSoup is more appropriate ? You can parse content as objects instead of dataframes, it could be easier to get the desired result. — iMad, Jul 19 '18 at 11:45
@GRipepi. the desired out is the table you see in the html, the last table in the page. — jason, Jul 20 '18 at 13:46
@iMad I have something that only gets part of the data using beautifulsoup. I thought i try it this way with pandas. — jason, Jul 20 '18 at 13:47

Nilesh Ingle · Answer 1 · 2020-02-06T13:26:57.510

1

Code below extracts the table using pd.read_html() from a website. Additional parameters could be tuned further depending on the table format.

# Import libraries
import pandas as pd

# Read table
link = 'https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm'
a=pd.read_html(link, header=None, skiprows=1)

# Save the dataframe
df = a[23]

# Remove NaN rows/columns
col_list = df.iloc[1]
df = df.loc[4:,[0,1,3,5,7,9,11]] # adjusted column names 
df.columns =  col_list[:len(df.columns)]
df.head(7)

Note: Empty cells in the original table are replaced with NaN's

Top rows from the original table from website:

edited Feb 06 '20 at 13:26

answered Jul 15 '18 at 22:18

Nilesh Ingle

1,777
11
17

My apologies. I have corrected the code. The values now match the original table added to the post. – Nilesh Ingle Jul 16 '18 at 01:19
Still dropping alot of the values in the table – jason Jul 16 '18 at 01:43
@jason: I'll see if I can get back to this and update the code further. – Nilesh Ingle Jul 20 '18 at 06:43
Excuse me buddy, where did you get 'data' variable? cannot understand 'a' variable neither, can you be more explicit, i think your code could help me a lot, thanks – M. Mariscal Feb 06 '20 at 09:12
Sorry about that, updated the code above to change 'data' to 'df'. – Nilesh Ingle Feb 06 '20 at 13:28

igrinis · Accepted Answer · 2018-07-23T09:45:03.080

It is always better to clean original data, because any processing might introduce artifacts. Your HTML table is created using span feature, and this is why it impossible to extract the data in generic way if you clean the DataFrame after HTML parsing. So I suggest you install a small module which is intended exactly to this: extracting data out of HTML tables. Run in your command line

pip install html-table-extractor

After this get the raw HTML of the page (you will need requests also), process the table and clean duplicate entries:

import requests
import pandas as pd
from collections import OrderedDict
from html_table_extractor.extractor import Extractor

pd.set_option('display.width', 400)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)

# get raw html
resp = requests.get('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')

# find last table
beg = resp.text.rfind('<table')
end = resp.text.rfind('</table')
html = resp.text[beg:end+8]

# process table
ex = Extractor(html)
ex.parse()
list_of_lines = ex.return_list()

# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)

## we need to consolidate some columns

# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitions
for el in list_of_lines[names_line]:
    col_names[el] = [i for i, x in enumerate(list_of_lines[names_line]) if x == el]

# now consolidate repetitive values
storage = OrderedDict() # this will contain columns
for k in col_names:
    res = []
    for line in list_of_lines[names_line+1:]:  # first 2 lines are empty, third is column names
        joined = [] # <- this list will accumulate *unique* values to become a single cell
        for idx in col_names[k]:
            el = line[idx]
            if joined and joined[-1]==el:   # if value already exist, skip
                continue
            joined.append(el)   # add unique value to cell
        res.append(''.join(joined))   # add cell to column
    storage[k] = res   # add column to storage
df = pd.DataFrame(storage)
print(df)

This will produce the following result, which is very close to original:

                                                                                                        Q1`17                   Q2`17                   Q3`17                   Q4`17                 FY 2017                   Q1`18
0                                                                                      (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)
1                                                                                                 (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)
2                                                                    Customer metrics                                                                                                                                                
3                                                               Customer accounts (1)                 57,000+                 61,000+                 65,000+                 70,000+                 70,000+                 74,000+
4                                               Customer accounts added in period (1)                  3,300+                  4,000+                  4,100+                  4,700+                 16,100+                  3,900+
5                                                     Deals greater than $100,000 (2)                     294                     372                     337                     590                   1,593                     301
6   Customer accounts that purchased greater than $1 million during the quarter (1,2)                      10                      15                      13                      27                                              13
7                                                                                                                                                                                                                                    
8                                                    Annual recurring revenue metrics                                                                                                                                                
9                                                  Total annual recurring revenue (3)                $439,001                $483,578                $526,211                $596,244                $596,244                $641,946
10                                          Subscription annual recurring revenue (4)                 $71,950                $103,538                $139,210                $195,488                $195,488                $237,533
11                                                                                                                                                                                                                                   
12                                               Geographic revenue metrics - ASC 606                                                                                                                                                
13                                                           United States and Canada                       —                       —                       —                       —                       —                $167,799
14                                                                      International                       —                       —                       —                       —                       —                 $78,408
..                                                                                ...                     ...                     ...                     ...                     ...                     ...                     ...
23                                                                                                                                                                                                                                   
24                                               Additional revenue metrics - ASC 606                                                                                                                                                
25                                              Remaining performance obligations (5)                       —                       —                       —                       —                 $99,580                $114,523
26                                                                                                                                                                                                                                   
27                                               Additional revenue metrics - ASC 605                                                                                                                                                
28                                          Ratable revenue as % of total revenue (6)                     54%                     56%                     63%                     60%                     59%                     72%
29                          Ratable license revenue as % of total license revenue (7)                     19%                     23%                     34%                     34%                     28%                     54%
30                   Services revenues as a % of maintenance and services revenue (8)                     12%                     13%                     12%                     13%                     13%                     11%
31                                                                                                                                                                                                                                   
32                                                         Bookings metrics - ASC 605                                                                                                                                                
33                                        Ratable bookings as % of total bookings (2)                     55%                     61%                     65%                     70%                     64%                     72%
34                        Ratable license bookings as % of total license bookings (2)                     26%                     37%                     45%                     51%                     41%                     59%
35                                                                                                                                                                                                                                   
36                                                                      Other metrics                                                                                                                                                
37                                                                Worldwide employees                   3,193                   3,305                   3,418                   3,489                   3,489                   3,663

This is very good. I was trying to use `pandas` because it seems that your solution is very specific and I was hoping for a more generalized solution. But maybe something like that doesn't exist. — jason, Jul 22 '18 at 16:13
On the contrary, the only parameter in my solution is line number of the column names. Everything else is generic. I will update the code to emphasize it. — igrinis, Jul 23 '18 at 06:01
`pandas` is great, but in this case it is not the right tool. The HTML's visual layout does not match the table structure. While it is possible to clean the `Dataframe`, it will require to treat each row type separately. — igrinis, Jul 23 '18 at 06:30
Great! Can you explain constant `names_line` please? also the number `8`, is that just the last table? there are 9 tables? — jason, Jul 23 '18 at 13:50
`names_line` is a number of row that contains columns labels. If you inspect `list_of_lines` you will see the parsed table content, row by row. — igrinis, Jul 24 '18 at 06:13
I understand you refer to `html = resp.text[beg:end+8]` . In this case 8 is a length of the closing `` tag. Since I used `rfind()` to find the HTML code of the table, I got the last table. If you need specific table, you can run a `for` loop on `beg, end = resp.text.find(' — igrinis, Jul 24 '18 at 06:45

pandas read_html clean up before or after read

2 Answers2

Linked