1

So i want to simply just read the Html of a website using

from urllib.request import urlopen
url = 'https://dictionary.cambridge.org/dictionary/english/water'
page = urlopen(url)

for some websites it woks but for some like in the above code i get the error

Traceback (most recent call last):
  File "F:/mohammad Desktop/work spaces/python/Python Turial Release 3.9.1/mod2.py", line 4, in <module>
    page = urlopen(url)
  File "C:\Python\Python38\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python\Python38\lib\urllib\request.py", line 525, in open
    response = self._open(req, data)
  File "C:\Python\Python38\lib\urllib\request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "C:\Python\Python38\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\Python\Python38\lib\urllib\request.py", line 1362, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "C:\Python\Python38\lib\urllib\request.py", line 1323, in do_open
    r = h.getresponse()
  File "C:\Python\Python38\lib\http\client.py", line 1322, in getresponse
    response.begin()
  File "C:\Python\Python38\lib\http\client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "C:\Python\Python38\lib\http\client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

there are some similar questions but did not work for me the solutions.

mike
  • 45
  • 5

1 Answers1

1

I was able to reproduce this behaviour.

It can be fixed by using a request object and changing the request headers to one that is more typically used in a web browser. For example firefox on a mac:

import urllib
import requests

url = 'https://dictionary.cambridge.org/dictionary/english/water'

req = urllib.request.Request(url, headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/534.50.2 (KHTML, like Gecko) Version/5.0.6 Safari/533.22.3'})
print(urllib.request.urlopen(req).read())

I would suggest that this is happening because https://dictionary.cambridge.org's web server has been set to block requests with headers associated with HTML scraping (like the default one for urllib.request.urlopen).

However, I am not sure about the ethics of intentionally using incorrect headers; they could be blocked for a reason...

  • Thank you for answer. The solution worked for me but i am not sure about you code snippet. Please edit your code so i could accept your answer. – mike Nov 20 '21 at 18:43
  • @mike What are you not sure about? I'd be happy to clarify anything. – qr7NmUTjF6vbA4n8V3J9 Nov 20 '21 at 18:45
  • why have you have imported request**s** (at line 2). isnt that a typo or mistake ? – mike Nov 20 '21 at 19:06
  • @mike It's not a typo, if you don't import it you get an `AttributeError` since `Request` is part of `requests` not `urllib`. You can actually get an HTML file using just the `requests` module, but since in your question you were using `urllib` I used that too. – qr7NmUTjF6vbA4n8V3J9 Nov 20 '21 at 19:24