0

For a while, i have been trying to use cloud9 to give me an extra hand on repetitive tasks on my job. However, the majority of those tasks have web scraping involved, and I am having a hard time using the urllib library.

It does not work even for simple codes. It keeps running forever, and I can't find the reasons for that. I will appreciate some tips...

from urllib.request import urlopen
import json
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = ('Enter - ')
html = urlopen(url, context=ctx).read()
str_data = open(html).read()
json_data = json.loads(str_data)

for entry in json_data:

name = entry[0];
    title = entry[1];
    print((name, title))
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470

4 Answers4

0

Without providing much debug information, one can only assume that you're probably timing out, since you're not providing the urlopen function with a timeout.

To avoid this, put a reasonable timeout to your urlopen function so that it doesn't run indefinitely.

alex067
  • 3,159
  • 1
  • 11
  • 17
0

Do yourself a favor and replace that with Requests. Your example simplifies to:

import requests

url = "..."
json_data = requests.get(url).json()

for entry in json_data:
    name = entry[0]
    title = entry[1]
    print(name, title)

That also removes a whole awful lot of potential for mistakes from your code.

Kirk Strauser
  • 30,189
  • 5
  • 49
  • 65
0

New code

import requests
import json

url = "-"
json_data = requests.get(url).json()
info = json.loads(json_data)

print('User count:', len(info))
print(info)

Debug Mode:

[IKP3db-g] 15:43:52,971733 - INFO - IKP3db 1.4.1 - Inouk Python Debugger for CPython 3.6+
[IKP3db-g] 15:43:52,972714 - INFO - IKP3db listening on 127.0.0.1:15471
[IKP3db-g] 15:43:53,003016 - INFO - Connected with 127.0.0.1:50936
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/connection.py", line 334, in connect
    conn = self._new_conn()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fa2bb9729e8>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='ot.cloud.mapsfinancial.com', port=443): Max retries exceeded with url: /pegasus/api/consulta/caixa/saldos/carteira/DVG1%20FIA/2018-01-08 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa2bb9729e8>: Failed to establish a new connection: [Errno 110] Connection timed out',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ikp3db.py", line 2047, in main
  File "/usr/local/lib/python3.6/dist-packages/ikp3db.py", line 1526, in _runscript
    exec(statement, globals, locals)
  File "<string>", line 1, in <module>
  File "/home/ubuntu/environment/Testing.py", line 5, in <module>
    json_data = requests.get(url).json()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='ot.cloud.mapsfinancial.com', port=443): Max retries exceeded with url: xxxxxxxx (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa2bb9729e8>: Failed to establish a new connection: [Errno 110] Connection timed out',))[IKP3db-g] 15:46:04,668520 - INFO - Uncaught exception. Entering post mortem debugging
0

You can use below code block

from pathlib import Path
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

def url_retrieve(url: str, outfile: Path):
    R = requests.get(url, allow_redirects=True, verify=False)
    if R.status_code != 200:
        raise ConnectionError('could not download {}\nerror code: {}'.format(url, R.status_code))

    outfile.write_bytes(R.content)
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 11 '23 at 02:04