python lxml.html: proper way to iterate through text with .tail in docstring order

Question

I'm trying to traverse an html string and concatenate the text content, with a string joiner that varies with the type of html tag encountered.

Example html: html_str='<td>This is how we parse our string together</td>'

I wrote a helper function called smart_itertext() to traverse an html element e via the method e.iter(). For each tag in e.iter(), it checks the tag and then appends the .text or .tail content.

My challenge is making the tail text show up in the right place. When I iterate by tag, I reach  and this appears to be my only chance to access the trailing text 'together'.

Desired result:

>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how::we::parse::our string::together'

Actual result:

>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how:: together::::we::parse::::our string'

This is my function:

def smart_itertext(tree, cross_joiner='::'):
empty_join= ['strong','b','em','i','small','marked','deleted',
            'ins', 'sub','sup']
cross_join = ['td','tr','br','p']
output=''
for element in tree.iter():
    if element.tag in empty_join:
        if element.text:
            output += element.text
        if element.tail:
            output += element.tail
    elif element.tag in cross_join:
        if element.text:
            output += cross_joiner + element.text
        else:
            output += cross_joiner
        if element.tail:
            output += cross_joiner + element.tail
    else:
        print ('unknown tag in smart_itertext:',element.tag)
return output

What's the proper way to accomplish this?

score 0 · Accepted Answer · answered Nov 23 '15 at 16:55

The answer is to use xpath, which allows you to build a list of content text as it occurs in document order, with attributes is_tail and is_text, and method getparent().

from lxml.html tutorial:

Note that a string result returned by XPath is a special 'smart' object that knows about its origins. You can ask it where it came from through its getparent() method, just as you would with Elements:
>>> texts = build_text_list(html)
>>> print(texts[0])
TEXT
>>> parent = texts[0].getparent()
>>> print(parent.tag)
body

>>> print(texts[1])
TAIL
>>> print(texts[1].getparent().tag)
br
You can also find out if it's normal text content or tail text:
>>> print(texts[0].is_text)
True
>>> print(texts[1].is_text)
False
>>> print(texts[1].is_tail)
True

Can you expand on "use xpath"? How exactly did you produce the desired result stated in the question? — mzjn, Nov 23 '15 at 18:29

python lxml.html: proper way to iterate through text with .tail in docstring order

1 Answers1