Scraping Excel File and read on the fly

Question

I'm trying to get an Excel file from a website (https://rigcount.bakerhughes.com/na-rig-count), download and save it into memory to read it with Pandas. The file is an .xlsb with more than 700,000 lines.

With the code I'm using, I am able to get only 1457 rows... I tried to play with the chunksize but it didn't work.

Here is my code:

from bs4 import BeautifulSoup
import requests
import os
import pandas as pd
from io import BytesIO, StringIO 

url = "https://rigcount.bakerhughes.com/na-rig-count"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
link_temp = soup.find_all('div', attrs={'class': 'file-link'})

for i in link_temp:
    if 'Rig Count Pivot Table' in i.find().text:
        link = i
        break
    
href = link.a.get('href')

response = requests.get(href)

#Store file in memory
f = BytesIO()

for line in response.iter_content():
    f.write(line)
    
f.seek(0)

pd.read_excel(f, engine='pyxlsb', sheet_name = 1, skiprows=0)

I tried to save it locally and open it back, but there is an issue with the encoding that I haven't been able to solve.

Thanks for you help ! :)

score 2 · Accepted Answer · edited Jul 18 '21 at 11:05

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        async with client.stream('GET', tfile) as r:
            fname = r.headers.get('content-disposition').split('=')[-1]
            async with await trio.open_file(fname, 'wb') as f:
                async for chunk in r.aiter_bytes():
                    await f.write(chunk)

        df = await trio.to_thread.run_sync(partial(pd.read_excel, fname, sheet_name=3, engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')

Output:

              Country      County        Basin DrillFor  ... Week RigCount State/Province  PublishDate
0       UNITED STATES      SABINE  Haynesville      Gas  ...   13        1      LOUISIANA        40634    
1       UNITED STATES  TERREBONNE        Other      Oil  ...   13        1      LOUISIANA        40634    
2       UNITED STATES   VERMILION        Other      Gas  ...   13        1      LOUISIANA        40634    
3       UNITED STATES   VERMILION        Other      Gas  ...   13        1      LOUISIANA        40634    
4       UNITED STATES        EDDY      Permian      Oil  ...   13        1     NEW MEXICO        40634    
...               ...         ...          ...      ...  ...  ...      ...            ...          ...    
769390  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769391  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769392  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769393  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769394  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    

[769395 rows x 13 columns]

~~>Note: Seems you reached a bug within `pyxlsb` reader. Reading the sheet using index is the reason but using `sheet_name='Master Data'` is works fine.~~

Update:

the problem is that the excel file has 2 hidden sheets, and the 2nd sheets really has 1457 rows, the Master Data is actually the 4th sheet, so sheet_name=3 will work

Last update:

In order to follow Python DRY Principle . I noticed that we don't need to save the file locally, or even visualize a file and store to the memory, and then load it to pandas.

Actually the response content itself stored into the memory, so we can load it all at once by passing r.content directly to pandas!

Use the below code:

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        r = await client.get(tfile)
        df = await trio.to_thread.run_sync(partial(pd.read_excel, r.content, sheet_name=3, engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')

@QHarr after investigating `pyxlsb`, i noticed it's not a bug. check updated answer. — αԋɱҽԃ αмєяιcαη, Jul 17 '21 at 08:02
Is there an actual sheet name? That would seem more robust than an index if the 3 is an index. Does the arg allow for passing a string literal? — QHarr, Jul 17 '21 at 08:05
@QHarr you can read it via `sheet_name=3` or `sheet_name="Master Data"` — αԋɱҽԃ αмєяιcαη, Jul 17 '21 at 08:06
@QHarr also you could do something like `x = pd.ExcelFile(fname)` and then `x.sheet_names` which will output `['Current Week Summary', 'Current Week Summary Data', 'Master Data Pivot', 'Master Data']` — αԋɱҽԃ αмєяιcαη, Jul 17 '21 at 08:08
Thanks for your explanation. I totally forgot to check if there was not hidden sheet! — Baguette1, Jul 17 '21 at 21:23
`soup.select_one('.file-link:-soup-contains(Table) > a ')['href']` ? - seems cleaner if applies. — QHarr, Jul 18 '21 at 11:07
@QHarr Yes, but i shorten it based on the patterns within the site. — αԋɱҽԃ αмєяιcαη, Jul 18 '21 at 11:09
@αԋɱҽԃαмєяιcαη I tried also to read straight the response content (without any storage) but I hade a reponse 3XX or 4XX doing like this. But I didn't try with a user agent! — Baguette1, Jul 18 '21 at 20:43
in case if you are trying to read it with pandas directly like `pd.read_html(url)` ---> in some cases the receiver DNS block the default User-Agent for requests which is `python-requests/2.25.1`. — αԋɱҽԃ αмєяιcαη, Jul 19 '21 at 06:17
@αԋɱҽԃαмєяιcαη No. Frozen through lack of visitors :-( — QHarr, Oct 18 '22 at 06:36
@αԋɱҽԃαмєяιcαη Had you noticed this error? https://stackoverflow.com/questions/68894819/httpx-module-httpcore-has-no-attribute-timeoutexception — QHarr, Oct 19 '22 at 11:27
@αԋɱҽԃαмєяιcαη I was surprised to see it appear with my scripts where I have httpx 0.22.0 — QHarr, Oct 28 '22 at 22:24

Scraping Excel File and read on the fly

1 Answers1