Problem: An unexplained ValueError("No tables found")
is being raised intermittently when using pandas read_html
in conjunction with a proxy-configuration to parse data from multiple webpages (Python 3.x).
Background: To access each webpage, http_url
is used as the target address. After each iteration of the loop, the {team}
parameter in http_url
is updated to access the next webpage (32 total pages – same host domain):
for team in teams:
http_url = f"http://www.footballguys.com/stats/game-logs-against/teams?team={team}&year={season_year}"
Problem Description: The target data from each webpage (http_url
) is retrieved/parsed into a list of pandas DataFrames using the read_html
method in one of two ways:
- Without Using a Proxy – The HTML is parsed directly from each webpage:
dataframe_list = pd.read_html(http_url)
Successful: This method always successfully returns the list of
DataFrames
from each webpage – loop completes after returning data from all 32 webpages.
- Using a Proxy: The the HTML is parsed from the returned unicode
GET
response converted to a string/file-like object usingio.StringIO
:
proxies = {
"http": "http://{}:{}@{}:{}".format(proxy_user, proxy_pass, proxy_host, proxy_port)
}
The
proxies
dict is configured by concatenatingproxy_user
,proxy_pass
,proxy_host
,proxy_port
, which are input strings for each proxy parameter of the same name, and passed to the=proxies
argument in eachGET
request.
source = requests.get(http_url, proxies=proxies, verify=False).text
dataframe_list = pd.read_html(io.StringIO(source))
Frequently Unsuccessful: This method frequently, and without explanation, returns
ValueError("No tables found")
. This error may be raised after the 2ndGET
request or the 29th, there is seemingly no pattern.
Additional Details: The results of five consecutive run-tests using the Option 2 - Proxy Method including the inspected response details from any returned failed requests:
# Successful Requests Before ValueError |
Details of Failed Response |
---|---|
2 | MAX_THREADS_REACHED |
21 | Request failed from proxy-provider:Request failed. You will not be charged for this request... |
13 | http.client.RemoteDisconnected:... During handling of the above exception, another exception occurred: urllib3.exceptions.MaxRetryError:... During handling of the above exception, another exception occurred: requests.exceptions.ProxyError:... Note: Error messages should be posted in full, but I haven't done so as this is not a ValueError . I've noted it incase it would be helpful to see the full error, but I thought it might be excessive to post right away, for what looks to be a blocked request (even though being blocked would be strange since the residential proxy-provider uses custom headers). |
14 | MAX_THREADS_REACHED |
All requests successful | - |
Edit - I find/found the MAX_THREADS_REACHED
response to be the most perplexing, therefore, I reached out to my residential proxy-provider and, unfortunately, they confirmed that that response was not being returned from their API. I had hoped they might provide some insight, as I have not been able to find any documentation with those specific response details, and I am stumped as to what could possibly be causing that error. For reference, my program is not multi-threaded & the proxy allows up to 5 concurrent requests, which I am not surpassing.