Scraping a nested and unstructured table in python (lxml)

Question

The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.

<table class='table'>
    <tr>
        <th>Serial No.
            <th>Full Name
                <tr>
                    <td>1
                        <td rowspan='1'> John 
                            <tr>
                                <td>2
                                    <td rowspan='1'>Jane Alleman
                                        <tr>
                                            <td>3
                                                <td rowspan='1'>Mukul Jha
                                                 .....
                                                 .....
                                                 .....
</table>

I tried the following xpaths but each of these is just giving me back a empty list.

persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()')]

persons = [x for x in tree.xpath('//table[@class="table"]/tr/td/td/text()')]

persons = [x for x in tree.xpath('//table[@class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s

Finally, what is the reason of such nesting, is it to prevent the scraping ?

maybe it is only mistake - someone forgot closing tags in code. — furas, Sep 10 '19 at 04:19
@furas its just a simple conversion of the actual table given by `page.text` of the site , given the structure is exact same . And second whats the need of such design. I think it is done to prevent scrapping (maybe). — Mukul Kumar Jha, Sep 10 '19 at 04:45
See the last and second last table design in https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station — Mukul Kumar Jha, Sep 10 '19 at 04:46
The last table your linked page seems pretty simple; can you post one full table from your site? — Jack Fleeting, Sep 10 '19 at 11:15
Yes, for example, you could use pandas to read that table, if pandas is available to you. — Jack Fleeting, Sep 10 '19 at 16:07
Can you please help me exactly how? Because the table is still broken (nested) badly. — Mukul Kumar Jha, Sep 10 '19 at 16:08
you can't do it only using `xpath`. You will have to create own parser - or rather create individual method for every row. — furas, Sep 10 '19 at 16:15
Why `xpath` is not enough ? Is it because of complicated nesting of table? @furas JackFleeting — Mukul Kumar Jha, Sep 10 '19 at 16:17
because table is not correctly constructed. Some elements have opening `` without closing ``, some columns don't have even opening ``, they use `
`. So there is one big mess in this table. There is no one rule which could describe how it is constructed. Every row has different mess. — furas, Sep 10 '19 at 16:22
for test I used BeautifulSoup and it shows HTML in original version with all mess but now I use `lxml` and `lxml.html.tostring(...)` to see table as HTML after loading it to memory and it seems it loads tags similar to browser and it creates correct table in memory so it needs normal xpath `/tr/td//text()` to get text from all cells. — furas, Sep 10 '19 at 16:35
FYI it’s __scraping__ (and __scraper__, __scrape__, __scraped__) not scrapping. ‘Scrapping’ means throwing away like rubbish. — DisappointedByUnaccountableMod, May 25 '21 at 15:33

furas · Accepted Answer · 2019-09-10T16:58:52.977

It seems lxml loads table in similar way as browser and it creates correct structure in memory and you can see correct HTML when you use lxml.html.tostring(table)

So it has correctly formated table and it needs normal './tr/td//text()' to get all values

import requests
import lxml.html

text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text

s = lxml.html.fromstring(text)

table = s.xpath('//table')[1]

for row in table.xpath('./tr'):
    cells = row.xpath('./td//text()')
    print(cells)

print(lxml.html.tostring(table, pretty_print=True).decode())

Result

['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']

<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td><a href="tel:8800793196">8800793196</a></td>
</tr>
</table>

Oryginal HTML for comparition - there are missing closing tags

<table class='table'>
<tr><td  title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>

I added `s = lxml.html.fromstring(text)`. Originally I had lines with `Beautifulsoup` in code which I cut off. It seems I removed too much lines :) — furas, Sep 10 '19 at 16:59

score 1 · Answer 2 · answered Sep 10 '19 at 17:20

Similar to furas' answer, but using pandas to scrape the last table on the page:

import requests
import lxml
import pandas as pd

url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)

root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[@rowspan]')
for i in info:
    row = []
    row.append(i.getprevious().text)
    row.append(i.text)
    rows.append(row)

columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1

Output:

   Gate Dwarka Sector 14 Metro Station
0   1   Eros Etro Mall
1   2   Nirmal Bharatiya Public School

Scraping a nested and unstructured table in python (lxml)

2 Answers2