How to get data from a webpage using Python

Question

Last year I had written a python script, to store data of COVID-19 cases (active, cured and deaths) from the website. The script was running fine initially but later due to modifications on the page I was just getting the first 2 rows which are the headers now, and nothing else. Earlier I was using pandas.read_html() method, but it's not able to grab all the data. I tried with the following but these are also not helping:

BeautifulSoup
lxml.html

Also tried the code as in here, but still the same issue. Any reasons why the issue and some other steps I could take?

Here is What I have tried till now:

Using pandas

url = "https://www.mohfw.gov.in/"

df_list = pd.read_html(url)

Using lmxl.html


>>> import requests
>>> page = requests.get(url)
>>> import lxml.html as lh
>>> doc = lh.fromstring(page.content)
>>> tbody_elements = doc.xpath('//tbody') # table is under `<tbody>` tag but it's able to get the data
>>> tbody_elements
[] # null here
>>> tr_elements = doc.xpath('//tr')
>>> tr_elements
[<Element tr at 0x7fb3f507d260>, <Element tr at 0x7fb3f507d2b8>, <Element tr at 0x7fb3f507d310>]
>>> len(tr_elements)
3
>>>for i in tr_elements:
...     print("Row - ", r)
...     for row in i:
...             print(row.text_content())
...     r=r+1
...

Output:

('Row - ', 1)

COVID-19 INDIA as on : 14 March 2021, 08:00 IST (GMT+5:30) [↑↓ Status change since yesterday]

('Row - ', 2)

S. No. Name of State / UT Active Cases* Cured/Discharged/Migrated* Deaths**

('Row - ', 3)

Total Change since yesterdayChange since yesterday Cumulative Change since yesterday Cumulative Change since yesterday

Using BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> url = 'https://www.mohfw.gov.in/'
>>> web_content = requests.get(url).content
>>> soup = BeautifulSoup(web_content, "html.parser")
>>> all_rows = soup.find_all('tr')
>>> all_rows
[<tr><h5>COVID-19 INDIA <span>as on : 15 March 2021, 08:00 IST (GMT+5:30)\t[\u2191\u2193 Status change since yesterday]</span></h5></tr>, <tr class="row1">\n<th rowspan="2" style="width:5%;"><strong>S. No.</strong></th>\n<th rowspan="2" style="width:24%;"><strong>Name of State / UT</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Active Cases*</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Cured/Discharged/Migrated*</strong></th>\n<th colspan="2" style="text-align:center;width:24%;"><strong>Deaths**</strong></th>\n</tr>, <tr class="row2"><th style="width: 12%;">Total</th><th style="width: 12%;"><span class="mob-hide">Change since yesterday</span><span class="mob-show">Change since<br/> yesterday</span></th>\n<th style="width: 12%;">Cumulative</th><th style="width: 12%;">Change since yesterday</th>\n<th style="width: 12%;">Cumulative</th><th style="width: 12%;">Change since yesterday</th></tr>]
>>> len(all_rows)
3

In both BeautifulSoup and lmxl.html, I am just getting the first two rows which are actually headers in the table.

Maybe show some of your code, and explain where the error occurs? — Xiddoc, Mar 15 '21 at 07:32

Martin Evans · Answer 1 · 2021-03-15T18:43:12.180

0

It looks like they've commented out the whole table. On my browser the table is not visible either:

You could use BeautifulSoup to find the comment entry and decode it as more soup, for example:

from bs4 import BeautifulSoup, Comment
import requests

url = 'https://www.mohfw.gov.in/'
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
trs = soup.find_all('tr')
comment = trs[-1].find_next(string=lambda text: isinstance(text, Comment))
table_soup = BeautifulSoup(comment, "html.parser")

for tr in table_soup.find_all('tr'):
    print([td.text for td in tr.find_all('td')])

This would give you output starting:

['1', 'Andaman and Nicobar Islands', '47', '133', '0']
['2', 'Andhra Pradesh', '18159', '19393', '492']
['3', 'Arunachal Pradesh', '387', '153', '3']
['4', 'Assam', '6818', '12888', '48']
['5', 'Bihar', '7549', '14018', '197']
['6', 'Chandigarh', '164', '476', '11']
['7', 'Chhattisgarh', '1260', '3451', '21']
['8', 'Dadra and Nagar Haveli and Daman and Diu', '179', '371', '2']
['9', 'Delhi', '17407', '97693', '3545']
['10', 'Goa', '1272', '1817', '19']

edited Mar 15 '21 at 18:43

answered Mar 15 '21 at 18:14

Martin Evans

45,791
17
81
97

Hey @Martin, actually NO! Initially I thought that too, but if you inspect you'll find the uncommented `` tag below the commented one. I have attached the same in the description as well. – Aroosh Rana Mar 15 '21 at 18:30
On my browser (firefox) the table is also not shown. The HTML clearly shows it commented out (the whole table, not just the tbody). Have you tried the code? – Martin Evans Mar 15 '21 at 19:15
Yes, I had fetched the comments earlier but I guess it was used for testing and thus is commented. I am saying that the below the commented html code, there's another section of `` that contains the uncommented data that is currently being shown. You can check that by hovering over the uncommented `` tag, it would highlight the data. – Aroosh Rana Mar 16 '21 at 05:47
I suggest you `print(soup)` rather than look at your browser source. For me, the HTML still only shows one `` and that is inside the comment. The comment ends at the end of the table with `25602-->` – Martin Evans Mar 16 '21 at 09:32
Oh.. got you, yes the soup is not reading the uncommented `` , you can check the values in the table, the values would differ, with that are in the comments. Could you have a reason why this be happening? – Aroosh Rana Mar 16 '21 at 11:09
beautifulsoup is correctly placing it inside a comment. If you `print(req.text)` you will also see `` The comment spans the whole table, it is not a single line comment. – Martin Evans Mar 16 '21 at 12:02

How to get data from a webpage using Python

1 Answers1