5

I am currently trying to extract all data from a table. Table data rows are formatted as <td headers="h1" align="left"></td> when there is no data.

Using the etree.tostring() method from the lxml library prints out these elements as <td headers="h1" align="left"/> instead of the source formatting.

Furthermore, using xpath if I run the code tree.path('//td[@headers="h1"]/text()') the resulting list does not include blank values where there is no data.

As I am trying to write these results to a CSV file, how do I include NULL, i.e. "" when there is no data?

toolshed
  • 1,919
  • 9
  • 38
  • 50

1 Answers1

2

One workaround would be to use //td[@headers="h1"] xpath to get the elements and then get the .text property on each:

from lxml import etree

data = """
<table>
    <tr>
        <td headers="h1" align="left"></td>
        <td headers="h1" align="left">Text1</td>
        <td headers="h1" align="left"/>
        <td headers="h1" align="left">Text2</td>
        <td headers="h1" align="left"></td>
    </tr>
</table>
"""

tree = etree.fromstring(data)
print [element.text for element in tree.xpath('//td[@headers="h1"]')]

Prints:

[None, 'Text1', None, 'Text2', None]

If you want empty string instead of None:

print [element.text if element.text is not None else ''
       for element in tree.xpath('//td[@headers="h1"]')]

would print:

['', 'Text1', '', 'Text2', '']

Also see: How do I return '' for an empty node's text() in XPath?

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195