2

I am using lxml to parse an html file:

from lxml import html

tree = html.parse(myfile)
data = tree.xpath('//p/text()')

I have 300 <p>text</p> tags in my html file, but len(data) is only 250 because sometimes I'll have <p></p> in my html. I want these to be included in data either as a 'nan' or ''.

Any suggestions on how to do this?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
user1566200
  • 1,826
  • 4
  • 27
  • 47

1 Answers1

2

//p/text() would only find you texts of p elements having a non-empty text.

Instead, find all p elements and call .text_content() for each one:

data = [p.text_content() for p in tree.xpath('//p')]

To demonstrate the difference:

>>> from lxml import html
>>> 
>>> 
>>> data = """
... <p>text1</p>
... <p></p>
... <p>text2</p>
... """
>>> 
>>> tree = html.fromstring(data)
>>> data = tree.xpath('//p/text()')
>>> len(data)
2
>>> 
>>> data = [p.text_content() for p in tree.xpath('//p')]
>>> len(data)
3
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195