I'm trying to traverse an html string and concatenate the text content, with a string joiner that varies with the type of html tag encountered.
Example html:
html_str='<td><p>This is how<br>we<br><strong>par<sup>s</sup>e</strong><br>our string</p> together</td>'
I wrote a helper function called smart_itertext()
to traverse an html element e
via the method e.iter()
. For each tag
in e.iter()
, it checks the tag and then appends the .text or .tail content.
My challenge is making the tail text show up in the right place. When I iterate by tag, I reach <p>
and this appears to be my only chance to access the trailing text 'together'.
Desired result:
>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how::we::parse::our string::together'
Actual result:
>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how:: together::::we::parse::::our string'
This is my function:
def smart_itertext(tree, cross_joiner='::'):
empty_join= ['strong','b','em','i','small','marked','deleted',
'ins', 'sub','sup']
cross_join = ['td','tr','br','p']
output=''
for element in tree.iter():
if element.tag in empty_join:
if element.text:
output += element.text
if element.tail:
output += element.tail
elif element.tag in cross_join:
if element.text:
output += cross_joiner + element.text
else:
output += cross_joiner
if element.tail:
output += cross_joiner + element.tail
else:
print ('unknown tag in smart_itertext:',element.tag)
return output
What's the proper way to accomplish this?