Unable to read Hindi/Devanagari with Python requests / urllib modules

Question

I'm trying to scrape this NREGA Website which contains data in Hindi i.e. Devanagari script. The structure is pretty easy to scrape. But when I use requests/urllib to get the html code, the Hindi text is getting converted to some gibberish. The text is displayed fine in the code source of the site though.

content = requests.get(URL).text

' 1 पी एस ' in the site is being parsed as ' 1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8 ' into content and is displayed as gibberish when I try to export to a csv.

You can download the excel file given on the website basically you can automate that process of downloading of excel file from the website which contains all the data instead of scraping and saving the data by your own. — Vin, Sep 21 '20 at 04:03
Yeah that's also not a problem. Once you will create a automation script with the dynamic parameter for which you want to download data it will go and download that excel file for you. What i'm guessing in your case dynamic values will be STATE, DISTRICT and BLOCK ? — Vin, Sep 21 '20 at 04:59
Can you please tell me how you are navigating to Rajasthan state page ? — Vin, Sep 21 '20 at 05:00
@Vin I used selenium on the parent website to extract all the URLs I want to scrape — Nik Hil, Sep 22 '20 at 04:02
Then using selenium you can click on the excel data link button to download the data for each state which is more faster way of extracting the data instead of parsing the whole table for each site and in your case they are in 100's — Vin, Sep 22 '20 at 04:06
I need other stuff to be done which is better handled by 'requests'. Thanks for your tips. — Nik Hil, Sep 22 '20 at 04:15

score 0 · Accepted Answer · answered Sep 21 '20 at 07:01

The response from the server doesn't specify a charset in it's Content-Type header, so requests assumes that the page is encoded as ISO-8859-1 (latin-1).

>>> r = requests.get('https://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/funddisreport_2701004_eng_1314_.html')
>>> r.encoding
'ISO-8859-1'

In fact, the page is encoded as UTF-8, as we can tell by inspecting the response's apparent_encoding attribute:

>>> r.apparent_encoding
'utf-8'

or by experiment:

>>> s = '1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8'
>>> s.encode('latin').decode('utf-8')
'1 पी एस'

The correct output can be obtained by decoding the response's content attribute:

>>> html = r.content.decode(r.apparent_encoding)

@RickJames only for the purpose of demonstrating that the mojibake in the question is decodable. — snakecharmerb, Sep 24 '20 at 06:54

Unable to read Hindi/Devanagari with Python requests / urllib modules

1 Answers1