1

Problem: An unexplained ValueError("No tables found") is being raised intermittently when using pandas read_html in conjunction with a proxy-configuration to parse data from multiple webpages (Python 3.x).


Background: To access each webpage, http_url is used as the target address. After each iteration of the loop, the {team} parameter in http_url is updated to access the next webpage (32 total pages – same host domain):

for team in teams:
    http_url = f"http://www.footballguys.com/stats/game-logs-against/teams?team={team}&year={season_year}"

Problem Description: The target data from each webpage (http_url) is retrieved/parsed into a list of pandas DataFrames using the read_html method in one of two ways:

  1. Without Using a Proxy – The HTML is parsed directly from each webpage:
dataframe_list = pd.read_html(http_url)

Successful: This method always successfully returns the list of DataFrames from each webpage – loop completes after returning data from all 32 webpages.

  1. Using a Proxy: The the HTML is parsed from the returned unicode GET response converted to a string/file-like object using io.StringIO:
proxies = {
            "http": "http://{}:{}@{}:{}".format(proxy_user, proxy_pass, proxy_host, proxy_port)
        }

The proxies dict is configured by concatenating proxy_user, proxy_pass, proxy_host, proxy_port, which are input strings for each proxy parameter of the same name, and passed to the =proxies argument in each GET request.

source = requests.get(http_url, proxies=proxies, verify=False).text
dataframe_list = pd.read_html(io.StringIO(source))

Frequently Unsuccessful: This method frequently, and without explanation, returns ValueError("No tables found"). This error may be raised after the 2nd GET request or the 29th, there is seemingly no pattern.


Additional Details: The results of five consecutive run-tests using the Option 2 - Proxy Method including the inspected response details from any returned failed requests:

# Successful Requests Before ValueError Details of Failed Response
2 MAX_THREADS_REACHED
21 Request failed from proxy-provider:Request failed. You will not be charged for this request...
13 http.client.RemoteDisconnected:...
During handling of the above exception, another exception occurred: urllib3.exceptions.MaxRetryError:...
During handling of the above exception, another exception occurred: requests.exceptions.ProxyError:...

Note: Error messages should be posted in full, but I haven't done so as this is not a ValueError. I've noted it incase it would be helpful to see the full error, but I thought it might be excessive to post right away, for what looks to be a blocked request (even though being blocked would be strange since the residential proxy-provider uses custom headers).
14 MAX_THREADS_REACHED
All requests successful -

Edit - I find/found the MAX_THREADS_REACHED response to be the most perplexing, therefore, I reached out to my residential proxy-provider and, unfortunately, they confirmed that that response was not being returned from their API. I had hoped they might provide some insight, as I have not been able to find any documentation with those specific response details, and I am stumped as to what could possibly be causing that error. For reference, my program is not multi-threaded & the proxy allows up to 5 concurrent requests, which I am not surpassing.

DAK
  • 116
  • 1
  • 10

0 Answers0