2

Trying to read a html content and extract the last table's content to an array using lxml.

Here is my last table:

<table border="1">
        <thead>
            <tr>
                <td><p>T1</p></td>
                <td><p>T2</p></td>
                <td><p>T3</p></td>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td><p>A1</p></td>
                <td><p></p></td>
                <td><p>A3</p></td>
            </tr>
        </tbody>
    </table>

When i run the below code, eol_table value is ['T1', 'T2', 'T3', 'A1', 'A3'] . Its not showing the None or blank value when <p> content is blank.

Expected value is ['T1', 'T2', 'T3', 'A1', '', 'A3']. How can i get the result like this ?

Code:

eol_html_content =  urlfetch.fetch("https://dl.dropboxusercontent.com/u/7384181/Test.html").content

import lxml.html as LH
html_root = LH.fromstring(eol_html_content)

eol_table = None
for tbl in html_root.xpath('//table'):
   eol_table = tbl.xpath('.//tr/td/p/text()')

self.response.out.write(eol_table)
Nijin Narayanan
  • 2,269
  • 2
  • 27
  • 46

1 Answers1

2

Root of your problem is, that the text() in your xpath is part of test for elements to retreive and as it is None for some p elements, it is not retreived.

The solution is to modify xpath to select all p elements and then get the text from it.

import lxml.html as LH

xmlstr = """
<table border="1">
    <thead>
        <tr>
            <td><p>T1</p></td>
            <td><p>T2</p></td>
            <td><p>T3</p></td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><p>A1</p></td>
            <td><p></p></td>
            <td><p>A3</p></td>
        </tr>
    </tbody>
</table>
"""

html_root = LH.fromstring(xmlstr)

eol_table = None
for tbl in html_root.xpath('//table'):
     p_elements = tbl.xpath('.//tr/td/p')
     eol_table = [p_elm.text for p_elm in p_elements]

     print eol_table

This prints:

['T1', 'T2', 'T3', 'A1', None, 'A3']

Alternative for case, where some element has no

(this updated request asked by Nijo and he also came with text_content() call)

xmlstr = """
<table border="1">
    <thead>
        <tr>
            <td><p>T1</p></td>
            <td><p>T2</p></td>
            <td><p>T3</p></td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><p>A1</p></td>
            <td><p></p></td>
            <td></td>
        </tr>
    </tbody>
</table>
"""
html_root = LH.fromstring(xmlstr)

eol_table = None
for tbl in html_root.xpath('//table'):
    td_elements = tbl.xpath('.//tr/td')
    eol_table = [td_elm.text_content() for td_elm in td_elements]
    print eol_table

what prints

['T1', 'T2', 'T3', 'A1', '', '']

As you see, text_content() never returns None but in None cases returns empty string ''

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • if have a `` with no `

    ` tag, how do i add `None` in list for that column?

    – Nijin Narayanan May 23 '14 at 07:14
  • @Nijo - to work with `` without `

    ` elements you shall change from `p_elements = tbl.xpath(".//tr/td/p")` to `td_elements = tbl.xpath(".//tr/td")`. Then loop over found `td` elements: if there is no `p` element in it, you return `None`, if there is `p`, return the `text()` of it. As this makes the looping over `td` a bit longer, I would not use list comprehension and use usual `for` loop on found `

    ` elements. Find your way how to get from `` to `

    ` yourself (of ask another question).

    – Jan Vlcinsky May 23 '14 at 10:02
  • `p_elements = tbl.xpath('.//tr/td')` `eol_table = [p_elm.text_content() for p_elm in p_elements]` This helped me to solve the issue. – Nijin Narayanan May 23 '14 at 10:28
  • @Nijo Thanks for `text_content`, I had the feeling there is something like that, but did not have time to complete my check. I added new part to my answer showing this. – Jan Vlcinsky May 23 '14 at 14:32