0

I'm trying to scrape this NREGA Website which contains data in Hindi i.e. Devanagari script. The structure is pretty easy to scrape. But when I use requests/urllib to get the html code, the Hindi text is getting converted to some gibberish. The text is displayed fine in the code source of the site though.

content = requests.get(URL).text

' 1 पी एस ' in the site is being parsed as ' 1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8 ' into content and is displayed as gibberish when I try to export to a csv.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Nik Hil
  • 13
  • 1
  • You can download the excel file given on the website basically you can automate that process of downloading of excel file from the website which contains all the data instead of scraping and saving the data by your own. – Vin Sep 21 '20 at 04:03
  • @Vin I need to scrape around 200k such queries. – Nik Hil Sep 21 '20 at 04:49
  • Yeah that's also not a problem. Once you will create a automation script with the dynamic parameter for which you want to download data it will go and download that excel file for you. What i'm guessing in your case dynamic values will be STATE, DISTRICT and BLOCK ? – Vin Sep 21 '20 at 04:59
  • Can you please tell me how you are navigating to Rajasthan state page ? – Vin Sep 21 '20 at 05:00
  • @Vin I used selenium on the parent website to extract all the URLs I want to scrape – Nik Hil Sep 22 '20 at 04:02
  • Then using selenium you can click on the excel data link button to download the data for each state which is more faster way of extracting the data instead of parsing the whole table for each site and in your case they are in 100's – Vin Sep 22 '20 at 04:06
  • I need other stuff to be done which is better handled by 'requests'. Thanks for your tips. – Nik Hil Sep 22 '20 at 04:15

1 Answers1

0

The response from the server doesn't specify a charset in it's Content-Type header, so requests assumes that the page is encoded as ISO-8859-1 (latin-1).

>>> r = requests.get('https://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/funddisreport_2701004_eng_1314_.html')
>>> r.encoding
'ISO-8859-1'

In fact, the page is encoded as UTF-8, as we can tell by inspecting the response's apparent_encoding attribute:

>>> r.apparent_encoding
'utf-8'

or by experiment:

>>> s = '1 \xe0\xa4\xaa\xe0\xa5\x80 \xe0\xa4\x8f\xe0\xa4\xb8'
>>> s.encode('latin').decode('utf-8')
'1 पी एस'

The correct output can be obtained by decoding the response's content attribute:

>>> html = r.content.decode(r.apparent_encoding)
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153