I'm currently working with a function that pulls corporate filings via a library that allows for scraping the SEC Edgar database. The issue I believe I'm having is that I'm trying to build a dataset of a few hundred companies through a loop that calls that function. Sometimes I make it through a hundred names, other times just a few dozen before getting an error. I'm getting the following error which I believe is the result of hitting the server too often?
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.sec.gov', port=443): Max retries exceeded with url: /Archives/edgar/data/899051/000089905119000007/0000899051-19-000007-index.htm (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7afd9c91d110>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
Does anyone have any tips or suggestions on how to add a sleep function and/or proxy updater that can help avoid getting throttled? If so, what are the best practices for incorporating, inside the function that scrapes the database or inside the loop that calls the function each time?