0

I am trying to extract nested table from the url 'http://gsa.nic.in/report/janDhan.html' using pandas with code:

import pandas as pd
url ="http://gsa.nic.in/report/janDhan.html"
table=pd.read_html(url)[3]
print(table)
table.to_excel("GSA.xlsx")

However it is printing only header of the table. Please guide.I am a newbie and don't want to use beautifulsoup. If pandas can't do the intended task then why?

Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • I'm afraid I could only help with BeautifulSoup. However, there is an export to excel button on that page. Why not use that and read the excel file into Pandas? – NomadMonad Apr 17 '20 at 16:47
  • Thanks Luca. Could you please share code on how to do this with BeautifulSoup. I am just trying to learn webscraping that's why i am not using that excel option. – Dhirendra Sinha Apr 17 '20 at 16:53

2 Answers2

1

The table is being populated by javascript, so it is not in the HTML that pandas is fetching. You can confirm this by viewing the source of the page in your browser and searching for values that are in the table, such as "PRADESH."

The solution is to use a library such as requests-html or selenium to scrape the javascript-rendered page. Then you can parse that HTML with pandas.

from requests_html import HTMLSession

s = HTMLSession()
r = s.get(url)
r.html.render()

table = pd.read_html(r.html)[3]
Eric Truett
  • 2,970
  • 1
  • 16
  • 21
0

So as Eric pointed out the table is being populated by JavaScript.

However, is quite easy to intercept the API call the page is doing internally by using Chrome's developer tools.

Go to network tab and filter by XHR and you will find the endpoint the page is making calls to, which is

http://gsa.nic.in/gsaservice/services/service.svc/gsastatereport?schemecode=PMJDY

enter image description here Then a simple script like this will get you the data nicely formatted

import json
import pandas as pd
import requests


r = requests.get('http://gsa.nic.in/gsaservice/services/service.svc/gsastatereport?schemecode=PMJDY')
data = json.loads(r.json()['d'])
pd.DataFrame(data[0]['data'])

LGDStateCode    StateName   totalSaturatedVillage   villageSaturatedTillDate    TotalBeneficiaries  TotalBeneficiariesRegisteredTillDate    Saturation
0   28  ANDHRA PRADESH  305 305 27238   27238   100.00
1   12  ARUNACHAL PRADESH   299 283 42331   39999   94.49
2   18  ASSAM   3042    2375    648815  621878  95.85
3   10  BIHAR   635 544 92356   90131   97.5



NomadMonad
  • 651
  • 6
  • 12
  • Thanks Luca. Perhaps i need to learn html & css before jumping to web scraping. Will just run your solution once the first one gets over. Eric's code running for last 15 minutes:) – Dhirendra Sinha Apr 17 '20 at 17:27
  • No problem. There's no HTML or CSS involved in this solution though. You just get the raw data. – NomadMonad Apr 17 '20 at 17:29
  • Yes. I know the code suggested by you above will absolutely solve my problem but the point remains i am not able to think on lines as suggested by you since i am not that conversant with web page designs. – Dhirendra Sinha Apr 17 '20 at 17:37