decode response of get request for persian websites

Question

I'm writing function for send request and get response of websites and parse of content of it... but when i send request to persian sites it cant decode content of it

def gather_links(page_url):
    html_string = ''
    try:
        response = urlopen(page_url)
        if 'text/html' in response.getheader('Content-Type'):
            html_bytes = response.read()
            html_string = html_bytes.decode("utf-8")    
    except Exception as e:
        print(str(e))

show this ERROR for example https://www.entekhab.ir/ :

'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

how can i change the code for decode this kind of sites too?

score 2 · Accepted Answer · answered Sep 27 '19 at 08:31

2

You should use requests instead of urllib.

import requests

response = requests.get('https://www.entekhab.ir/')
print(response.text)

answered Sep 27 '19 at 08:31

askaroni

913
5
10

ye its work better and correctly work...request lib have any difference urllib.request packages? – ali frd Sep 27 '19 at 19:06
`requests` is better and everybody uses it. – askaroni Sep 28 '19 at 20:44

score 1 · Answer 2 · answered Sep 28 '19 at 14:51

The problem is that the url's content is compressed with gzip, which urlopen does not seem to handle by default:

>>> r = request.urlopen('https://www.entekhab.ir/')
>>> print(r.info())
Server: sepehr-proxy-1.2-rc3
Content-Type: text/html; charset=utf-8
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Content-Length: 81269
Date: Sat, 28 Sep 2019 14:41:49 GMT
Content-Encoding: gzip

Therefore, you need to decompress the response before decoding:

>>> import gzip
>>> bs = gzip.decompress(r.read())
>>> bs.decode('utf-8')[:113]
'<!-- 2019/09/28 18:16:08 --><!DOCTYPE html> <html lang="fa-IR" dir="rtl"> <head>           <meta charset="utf-8">'

As user askaroni's answer points out, the requests package handles this case automatically, and even the Python urllib docs recommend using it. Nevertheless, it's useful to understand why the response could not be decoded immediately.

Thanks a lot ... Ye your answer so professionally and you Solve the problem from the base — ali frd, Sep 30 '19 at 17:02

decode response of get request for persian websites

2 Answers2