1

I am using kaggle browser. Looking to see if all the below can be done on this kaggle notebook.

Website url: click here

Website screenshot:

enter image description here

The downloading files here in the website are updated every hour and daily. I don't think any information on this website going to change except the xlsx file content as you see in the website.

I want to download two things from this url: meta information and xlsx files you see in the screenshot.

First, I want to download this meta information and make it a dataframe as given below. Now I am manually selecting them, copying them here. But I want to do it from the url

url_meta_df = 

ID   Type   Name        URL
CAL  Region California  https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAL.xlsx
CAR  Region Carolinas   https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAR.xlsx
CENT Region Central     https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CENT.xlsx
FLA  Region Florida     https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_FLA.xlsx

Second: download each xlsx file, save them.

My code: I have tried following based on an answer here in SO

from bs4 import BeautifulSoup

import requests
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

Present output:

None
https://twitter.com/EIAgov
None
https://www.facebook.com/eiagov
None
#page-sub-nav
/
#
/petroleum/
/petroleum/weekly/
/petroleum/supply/weekly/
/naturalgas/
http://ir.eia.gov/ngs/ngs.html
/naturalgas/weekly/
/electricity/
/electricity/monthly/
....
baduker
  • 19,152
  • 9
  • 33
  • 56
Mainland
  • 4,110
  • 3
  • 25
  • 56

2 Answers2

4

This should get you all the files.

NOTE: This might take a while as the files are big - 20MB+ each. The total size is over 3GB

The meta info you need is in the get_data(). It's a list of dictionaries.

import os
import random
import time
from pathlib import Path
from shutil import copyfileobj
from typing import List, Dict

import pandas as pd
import requests
from tabulate import tabulate

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203",
}

API_URL = "https://www.eia.gov/electricity/930-api//respondents/data"
FILE_BASE_URL = "https://www.eia.gov/electricity/gridmonitor/knownissues/xls"
DOWNLOAD_DIR = Path("xls_files")

PARAMS = {
    "type[0]": "BA",
    "type[1]": "BR",
    "type[2]": "ER",
}


def wait_a_bit() -> None:
    pseudo_random_wait = random.randint(1, 5)
    print(f"Waiting {pseudo_random_wait} seconds...")
    time.sleep(pseudo_random_wait)


def build_xls_links(data: List[dict]) -> Dict[str, str]:
    links = {}
    for item in data:
        if item["type"] == "BA" and item["time_zone"] is not None:
            links[item["id"]] = f"{FILE_BASE_URL}/{item['id']}.xlsx"

        elif item["type"] == "ER":
            links[item["id"]] = f"{FILE_BASE_URL}/Region_{item['id']}.xlsx"
    return links


def build_metadata_table(data: List[dict], links: Dict[str, str]) -> List[List[str]]:
    metadata_table = []
    for item in data:
        metadata_table.append(
            [
                item["id"],
                item["name"],
                item["type"] if item["type"] in ["BA", "ER"] else "Unknown",
                links.get(item["id"], "N/A")
            ]
        )
    return metadata_table


def get_data(connection: requests.Session) -> List[dict]:
    return connection.get(API_URL, params=PARAMS).json()[0]["data"]


def create_and_save_table(data: List[Dict], xls_links: Dict[str, str]) -> None:
    # Do some funky stuff to get the data into a nice table
    metadata_table = build_metadata_table(data, xls_links)
    df = pd.DataFrame(metadata_table, columns=["ID", "Name", "Type", "Link"])
    df.drop(df[df["Link"] == "N/A"].index, inplace=True)
    df.drop(df[df["Type"] == "Unknown"].index, inplace=True)
    df["Type"] = df["Type"].map({"ER": "Region", "BA": "BA"})
    df.sort_values(
        by=["Type", "ID"],
        inplace=True,
        ascending=[False, True],
    )
    df.to_csv(DOWNLOAD_DIR / "metadata.csv", index=False)
    print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))


def fetch_files(connection: requests.Session, xls_links: Dict[str, str]) -> None:
    total_files = len(xls_links)
    for count, file_link in enumerate(xls_links.values(), start=1):
        file_name = file_link.split("/")[-1]
        print(f"Downloading file {count} of {total_files}: {file_name}")
        file_response = connection.get(file_link, stream=True)
        with open(DOWNLOAD_DIR / file_name, "wb") as f:
            copyfileobj(file_response.raw, f)
        wait_a_bit()


def downloader() -> None:
    with requests.Session() as connection:
        connection.headers.update(headers)

        os.makedirs(DOWNLOAD_DIR, exist_ok=True)

        data = get_data(connection)
        xls_links = build_xls_links(data)

        create_and_save_table(data, xls_links)

        print("Would you like to download the files? (y/n)")
        user_input = input(">>> ")
        if user_input.lower() != "y":
            exit()

        fetch_files(connection, xls_links)


if __name__ == "__main__":
    downloader()

You should see this:

+------+--------------------------------------------------------------------+--------+------------------------------------------------------------------------------+
| ID   | Name                                                               | Type   | Link                                                                         |
|------+--------------------------------------------------------------------+--------+------------------------------------------------------------------------------|
| CAL  | California                                                         | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAL.xlsx  |
| CAN  | Canada                                                             | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAN.xlsx  |
| CAR  | Carolinas                                                          | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CAR.xlsx  |
| CENT | Central                                                            | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_CENT.xlsx |
| FLA  | Florida                                                            | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_FLA.xlsx  |
| MEX  | Mexico                                                             | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_MEX.xlsx  |
| MIDA | Mid-Atlantic                                                       | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_MIDA.xlsx |
| MIDW | Midwest                                                            | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_MIDW.xlsx |
| NA   | Outside Lower 48                                                   | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_NA.xlsx   |
| NW   | Northwest                                                          | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_NW.xlsx   |
| SE   | Southeast                                                          | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_SE.xlsx   |
| SW   | Southwest                                                          | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_SW.xlsx   |
| US48 | United States Lower 48                                             | Region | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/Region_US48.xlsx |
| AEC  | PowerSouth Energy Cooperative                                      | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AEC.xlsx         |
| AECI | Associated Electric Cooperative, Inc.                              | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AECI.xlsx        |
| AVA  | Avista Corporation                                                 | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AVA.xlsx         |
| AVRN | Avangrid Renewables, LLC                                           | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AVRN.xlsx        |
| AZPS | Arizona Public Service Company                                     | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/AZPS.xlsx        |
| BANC | Balancing Authority of Northern California                         | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/BANC.xlsx        |
| BPAT | Bonneville Power Administration                                    | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/BPAT.xlsx        |
| CHPD | Public Utility District No. 1 of Chelan County                     | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CHPD.xlsx        |
| CISO | California Independent System Operator                             | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CISO.xlsx        |
| CPLE | Duke Energy Progress East                                          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CPLE.xlsx        |
| CPLW | Duke Energy Progress West                                          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/CPLW.xlsx        |
| DEAA | Arlington Valley, LLC                                              | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/DEAA.xlsx        |
| DOPD | PUD No. 1 of Douglas County                                        | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/DOPD.xlsx        |
| DUK  | Duke Energy Carolinas                                              | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/DUK.xlsx         |
| EEI  | Electric Energy, Inc.                                              | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/EEI.xlsx         |
| EPE  | El Paso Electric Company                                           | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/EPE.xlsx         |
| ERCO | Electric Reliability Council of Texas, Inc.                        | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/ERCO.xlsx        |
| FMPP | Florida Municipal Power Pool                                       | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/FMPP.xlsx        |
| FPC  | Duke Energy Florida, Inc.                                          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/FPC.xlsx         |
| FPL  | Florida Power & Light Co.                                          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/FPL.xlsx         |
| GCPD | Public Utility District No. 2 of Grant County, Washington          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GCPD.xlsx        |
| GLHB | GridLiance                                                         | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GLHB.xlsx        |
| GRID | Gridforce Energy Management, LLC                                   | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GRID.xlsx        |
| GRIF | Griffith Energy, LLC                                               | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GRIF.xlsx        |
| GRMA | Gila River Power, LLC                                              | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GRMA.xlsx        |
| GVL  | Gainesville Regional Utilities                                     | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GVL.xlsx         |
| GWA  | NaturEner Power Watch, LLC                                         | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/GWA.xlsx         |
| HGMA | New Harquahala Generating Company, LLC                             | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/HGMA.xlsx        |
| HST  | City of Homestead                                                  | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/HST.xlsx         |
| IID  | Imperial Irrigation District                                       | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/IID.xlsx         |
| IPCO | Idaho Power Company                                                | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/IPCO.xlsx        |
| ISNE | ISO New England                                                    | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/ISNE.xlsx        |
| JEA  | JEA                                                                | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/JEA.xlsx         |
| LDWP | Los Angeles Department of Water and Power                          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/LDWP.xlsx        |
| LGEE | Louisville Gas and Electric Company and Kentucky Utilities Company | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/LGEE.xlsx        |
| MISO | Midcontinent Independent System Operator, Inc.                     | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/MISO.xlsx        |
| NEVP | Nevada Power Company                                               | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NEVP.xlsx        |
| NSB  | Utilities Commission of New Smyrna Beach                           | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NSB.xlsx         |
| NWMT | NorthWestern Corporation                                           | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NWMT.xlsx        |
| NYIS | New York Independent System Operator                               | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/NYIS.xlsx        |
| OVEC | Ohio Valley Electric Corporation                                   | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/OVEC.xlsx        |
| PACE | PacifiCorp East                                                    | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PACE.xlsx        |
| PACW | PacifiCorp West                                                    | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PACW.xlsx        |
| PGE  | Portland General Electric Company                                  | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PGE.xlsx         |
| PJM  | PJM Interconnection, LLC                                           | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PJM.xlsx         |
| PNM  | Public Service Company of New Mexico                               | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PNM.xlsx         |
| PSCO | Public Service Company of Colorado                                 | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PSCO.xlsx        |
| PSEI | Puget Sound Energy, Inc.                                           | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PSEI.xlsx        |
| SC   | South Carolina Public Service Authority                            | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SC.xlsx          |
| SCEG | Dominion Energy South Carolina, Inc.                               | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SCEG.xlsx        |
| SCL  | Seattle City Light                                                 | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SCL.xlsx         |
| SEC  | Seminole Electric Cooperative                                      | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SEC.xlsx         |
| SEPA | Southeastern Power Administration                                  | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SEPA.xlsx        |
| SOCO | Southern Company Services, Inc. - Trans                            | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SOCO.xlsx        |
| SPA  | Southwestern Power Administration                                  | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SPA.xlsx         |
| SRP  | Salt River Project Agricultural Improvement and Power District     | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SRP.xlsx         |
| SWPP | Southwest Power Pool                                               | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/SWPP.xlsx        |
| TAL  | City of Tallahassee                                                | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TAL.xlsx         |
| TEC  | Tampa Electric Company                                             | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TEC.xlsx         |
| TEPC | Tucson Electric Power                                              | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TEPC.xlsx        |
| TIDC | Turlock Irrigation District                                        | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TIDC.xlsx        |
| TPWR | City of Tacoma, Department of Public Utilities, Light Division     | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TPWR.xlsx        |
| TVA  | Tennessee Valley Authority                                         | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/TVA.xlsx         |
| WACM | Western Area Power Administration - Rocky Mountain Region          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WACM.xlsx        |
| WALC | Western Area Power Administration - Desert Southwest Region        | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WALC.xlsx        |
| WAUW | Western Area Power Administration - Upper Great Plains West        | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WAUW.xlsx        |
| WWA  | NaturEner Wind Watch, LLC                                          | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/WWA.xlsx         |
| YAD  | Alcoa Power Generating, Inc. - Yadkin Division                     | BA     | https://www.eia.gov/electricity/gridmonitor/knownissues/xls/YAD.xlsx         |
+------+--------------------------------------------------------------------+--------+------------------------------------------------------------------------------+

And if you type y after the prompt:

Would you like to download the files? (y/n)

You'll get this:

Downloading file 1 of 81: Region_CENT.xlsx
Waiting 5 seconds...
Downloading file 2 of 81: AEC.xlsx
Waiting 4 seconds...
Downloading file 3 of 81: AECI.xlsx
Waiting 5 seconds...
Downloading file 4 of 81: AVA.xlsx
Waiting 4 seconds...
Downloading file 5 of 81: AZPS.xlsx
Waiting 2 seconds...
Downloading file 6 of 81: BANC.xlsx
Waiting 2 seconds...
Downloading file 7 of 81: BPAT.xlsx
Waiting 1 seconds...

and more...

However, if you don't want to wait, you can get the entire dump here.

baduker
  • 19,152
  • 9
  • 33
  • 56
  • 1
    Simply superb. I would have not done it by myself. I just thought of downloading each excel file, copying each data entry manually. But your solution showed that many things are possible. Thank you very much for writing the solution using functions. Now I can go one by one, learn myself new things today. There is so much to learn in your solution. – Mainland Sep 01 '23 at 21:25
  • I have two questions here: 1. The solution you presented is an API based or web scrapping? 2. As I said, these xlsx files update daily. Suppose, after a week, if I want to download last one week data and append it to existing data. Do you think such things is possible? or , Do I have download all data from beginning every time I want to download? – Mainland Sep 01 '23 at 21:25
  • In addition to my two questions above, 3. How and where did you manage to get the information about `API_URL = "https://www.eia.gov/electricity/930-api//respondents/data" FILE_BASE_URL = "https://www.eia.gov/electricity/gridmonitor/knownissues/xls" DOWNLOAD_DIR = Path("xls_files")` Because the url I gave you is this one `https://www.eia.gov/electricity/gridmonitor/dashboard/electric_overview/US48/US48` and this url was not used. I am highly curious to know how you were able to retrieve this information? How do you know that this specific URL is needed to download the data I requested. – Mainland Sep 01 '23 at 21:41
  • 1. Yes. it's still web scraping. 2. How do you know the files change? If the new files are published under the same name, you'd have to download all of them again or come up with a way to check for changes. 3. I got all the endpoints from the Network tab in Developer's tools in my web browser and then I did some reverse engineering and educated guessing. – baduker Sep 02 '23 at 09:14
-1

The main problem with what you are trying to do is that the table with the *.xlsx is rendered by Javascript after you press the "Download Data" button, so to actually access the table data you would first have to use something like the Python webbrowser module to click the button and generate the HTML of the table.

But there is a quicker and easier way to do this: the reference table is in fact static - the URLs are fixed. Therefore, you dont really need to access the table if you want to download the xlsx files. You can get the HTML code of the page rendered by Javascript by following this answer to get the metadata table, paste it to an HTML file and save it. After that, you can use the following code to create the pandas df and download the xlsx files.

from bs4 import BeautifulSoup
import pandas as pd
import requests

html_file = #PATH TO HTML FILE

with open(html_file) as f:
    soup = BeautifulSoup(f.read())

# there is more than one table - it's the last one
table = soup.find_all('table')[-1]

# use pandas to create df from table
df = pd.read_html(str(table))[0]

# loop for each url in the df column
for xlsx_url in df['URL']:

    # get the xlsx file
    resp = requests.get(xlsx_url)

    # get the filename to save
    filename = xlsx_url.split('/')[-1]

    #save it
    with open(filename, 'wb') as f:
        f.write(resp.content)
  • This is not what the OP wants. Your answer doesn't download anything and is incomplete. What's `PATH TO HTML FILE`? – baduker Sep 01 '23 at 19:33
  • Basically, OP wants to download the xlsx files. To get the links, they need html code rendered by javascript, and they can't do it with requests. Since the links to the tables are static, they can get the javascript rendered html using the linked answer, and then download the files using the code I provided. – Lucas Moura Gomes Sep 01 '23 at 19:49
  • Yes, they can. See my answer above. – baduker Sep 01 '23 at 19:55
  • Requests cannot run javascript. Your code works because of the API provided by the website. – Lucas Moura Gomes Sep 01 '23 at 20:00
  • I know that, but that's how the website works anyway. It takes the data from the API and then generates the file links. The OP wants to download all the xlsx files and that's exactly what they get. – baduker Sep 01 '23 at 20:02
  • Congratulations for downloading the xlsx files. But I would rather get an answer that explains what I'm doing wrong and solves the problem with a simple solution instead. I could download the sheets with the code I provided anyway. – Lucas Moura Gomes Sep 01 '23 at 20:08
  • 1
    Your answer might be pushing the OP in the right direction but as it stands now it is incomplete and simply doesn't work. The link to the other answer is useful but requires a few extra manual steps that you don't cover in your code. – baduker Sep 01 '23 at 20:23