So I have a relatively complex XML encoding where the text can contain an open number of elements. Let's take this simplified example:
<div>
<p>-I like James <stage><hi>he said to her </hi></stage>, but I am not sure James understands <hi>Peter</hi>'s problems.</p>
</div>
I want to enclose all named entities in the sentence (the two instances of James and Peter) with an rs
element:
<div>
<p>-I like <rs>James</rs> <stage><hi>he said to her </hi></stage>, but I am not sure <rs>James</rs> understands <hi><rs>Peter</rs></hi>'s problems.</p>
</div>
To simplify this, let's say I have a list of names I could find in the text, such as:
names = ["James", "Peter", "Mary"]
I want to use lxml for this. I know I could use the etree.SubElement()
and append a new element at the end of the p
element, but I don't know how to deal with the tails and the other possible elements.
I understand that I need to handle the three references in my example differently.
- The first
James
is in the text of thep
element. I could just do this:
p = etree.SubElement(div, "p")
p.text = "-I like <rs>James</rs>"
Right?
- The second
James
is in the tail of thep
element. I don't know how to deal with that. - The reference to
Peter
is in the text ofhi
element. I guess I have to iterate through all possible elements, look both at the text and at the tail of each element and look for the named entities of my list.
rs = etree.SubElement(hi, "rs")
rs.text = "<rs>Peter</rs>"
My guess is that there is a much better way to handle all of this. Any help? Thanks in advance!