Last year I had written a python script, to store data of COVID-19 cases (active, cured and deaths) from the website. The script was running fine initially but later due to modifications on the page I was just getting the first 2 rows which are the headers now, and nothing else. Earlier I was using pandas.read_html() method, but it's not able to grab all the data. I tried with the following but these are also not helping:
- BeautifulSoup
- lxml.html
Also tried the code as in here, but still the same issue. Any reasons why the issue and some other steps I could take?
Here is What I have tried till now:
- Using
pandas
url = "https://www.mohfw.gov.in/"
df_list = pd.read_html(url)
- Using lmxl.html
>>> import requests
>>> page = requests.get(url)
>>> import lxml.html as lh
>>> doc = lh.fromstring(page.content)
>>> tbody_elements = doc.xpath('//tbody') # table is under `<tbody>` tag but it's able to get the data
>>> tbody_elements
[] # null here
>>> tr_elements = doc.xpath('//tr')
>>> tr_elements
[<Element tr at 0x7fb3f507d260>, <Element tr at 0x7fb3f507d2b8>, <Element tr at 0x7fb3f507d310>]
>>> len(tr_elements)
3
>>>for i in tr_elements:
... print("Row - ", r)
... for row in i:
... print(row.text_content())
... r=r+1
...
Output:
('Row - ', 1)
COVID-19 INDIA as on : 14 March 2021, 08:00 IST (GMT+5:30) [↑↓ Status change since yesterday]
('Row - ', 2)
S. No. Name of State / UT Active Cases* Cured/Discharged/Migrated* Deaths**
('Row - ', 3)
Total Change since yesterdayChange since yesterday Cumulative Change since yesterday Cumulative Change since yesterday
- Using
BeautifulSoup
>>> from bs4 import BeautifulSoup
>>> url = 'https://www.mohfw.gov.in/'
>>> web_content = requests.get(url).content
>>> soup = BeautifulSoup(web_content, "html.parser")
>>> all_rows = soup.find_all('tr')
>>> all_rows
[<tr><h5>COVID-19 INDIA <span>as on : 15 March 2021, 08:00 IST (GMT+5:30)\t[\u2191\u2193 Status change since yesterday]</span></h5></tr>, <tr class="row1">\n<th rowspan="2" style="width:5%;"><strong>S. No.</strong></th>\n<th rowspan="2" style="width:24%;"><strong>Name of State / UT</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Active Cases*</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Cured/Discharged/Migrated*</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Deaths**</strong></th>\n</tr>, <tr class="row2"><th style="width: 12%;">Total</th><th style="width: 12%;"><span class="mob-hide">Change since yesterday</span><span class="mob-show">Change since<br/> yesterday</span></th>\n<th style="width: 12%;">Cumulative</th><th style="width: 12%;">Change since yesterday</th>\n<th style="width: 12%;">Cumulative</th><th style="width: 12%;">Change since yesterday</th></tr>]
>>> len(all_rows)
3
In both BeautifulSoup and lmxl.html, I am just getting the first two rows which are actually headers in the table.