1

I'm writing function for send request and get response of websites and parse of content of it... but when i send request to persian sites it cant decode content of it

def gather_links(page_url):
    html_string = ''
    try:
        response = urlopen(page_url)
        if 'text/html' in response.getheader('Content-Type'):
            html_bytes = response.read()
            html_string = html_bytes.decode("utf-8")    
    except Exception as e:
        print(str(e))

show this ERROR for example https://www.entekhab.ir/ :

'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

how can i change the code for decode this kind of sites too?

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
ali frd
  • 23
  • 7

2 Answers2

2

You should use requests instead of urllib.

import requests

response = requests.get('https://www.entekhab.ir/')
print(response.text)
askaroni
  • 913
  • 5
  • 10
1

The problem is that the url's content is compressed with gzip, which urlopen does not seem to handle by default:

>>> r = request.urlopen('https://www.entekhab.ir/')
>>> print(r.info())
Server: sepehr-proxy-1.2-rc3
Content-Type: text/html; charset=utf-8
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Content-Length: 81269
Date: Sat, 28 Sep 2019 14:41:49 GMT
Content-Encoding: gzip

Therefore, you need to decompress the response before decoding:

>>> import gzip
>>> bs = gzip.decompress(r.read())
>>> bs.decode('utf-8')[:113]
'<!-- 2019/09/28 18:16:08 --><!DOCTYPE html> <html lang="fa-IR" dir="rtl"> <head>           <meta charset="utf-8">'

As user askaroni's answer points out, the requests package handles this case automatically, and even the Python urllib docs recommend using it. Nevertheless, it's useful to understand why the response could not be decoded immediately.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153