1

So I have a relatively complex XML encoding where the text can contain an open number of elements. Let's take this simplified example:

<div>
<p>-I like James <stage><hi>he said to her </hi></stage>, but I am not sure James understands <hi>Peter</hi>'s problems.</p>
</div>

I want to enclose all named entities in the sentence (the two instances of James and Peter) with an rs element:

<div>
<p>-I like <rs>James</rs> <stage><hi>he said to her </hi></stage>, but I am not sure <rs>James</rs> understands <hi><rs>Peter</rs></hi>'s problems.</p>
</div>

To simplify this, let's say I have a list of names I could find in the text, such as:

names = ["James", "Peter", "Mary"]

I want to use lxml for this. I know I could use the etree.SubElement() and append a new element at the end of the p element, but I don't know how to deal with the tails and the other possible elements.

I understand that I need to handle the three references in my example differently.

  1. The first James is in the text of the p element. I could just do this:
p = etree.SubElement(div, "p")
p.text = "-I like <rs>James</rs>"

Right?

  1. The second James is in the tail of the p element. I don't know how to deal with that.
  2. The reference to Peter is in the text of hi element. I guess I have to iterate through all possible elements, look both at the text and at the tail of each element and look for the named entities of my list.
rs = etree.SubElement(hi, "rs")
rs.text = "<rs>Peter</rs>"

My guess is that there is a much better way to handle all of this. Any help? Thanks in advance!

José
  • 533
  • 1
  • 4
  • 14

2 Answers2

2

I know you want to use lxml, but XSLT is custom-made for this sort of thing. In XSLT 3.0,

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0" expand-text="yes">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:param name="names" select="'James', 'Peter', 'Mary'"/>
<xsl:template match="text()">
  <xsl:analyze-string select="." 
                      regex="{string-join($names,'|')}">
     <xsl:matching-substring>
       <rs>{.}</rs>
     </xsl:matching-substring>
     <xsl:non-matching-substring>{.}</xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>
</xsl:transform>
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • 1
    A couple of notes for the OP: Mixed content (a mix of text and element nodes) is not easy to deal with in lxml. XSLT handles this much easier. Now that Saxon (saxonche) is an easy pypi install, it wouldn't be very difficult to run Dr Kay's XSLT in Python. – Daniel Haley May 26 '23 at 16:47
  • Thank you so much for your answer, Dr. Kay, I will consider this. – José May 27 '23 at 04:40
1

It's a little convoluted, but can be done.

Let's say your XML looks like this:

play = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
   <div>
      <p>
         -I like James
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure James understands
         <hi>Peter</hi>
         's problems.
      </p>
   </div>
   <div>
      <p>
         -I like Mary
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure Peter understands
         <hi>James</hi>
         's problems.
      </p>
   </div>
</root>
'''

I inserted another div, and added formatting for clarity. Note that this assumes that each <div> contains only one <p>; if that's not the case, it will have to be refined more.

doc = etree.XML(play.encode())
names = ["James", "Peter", "Mary"]

#find all the divs that need changing
destinations = doc.xpath('//div')

#extract the string representation of the current <p> (the "target")
for destination in destinations:
    target = destination.xpath('./p')[0]
    target_str = etree.tostring(target).decode()

    #replace the names with the required tag:
    for name in names:
        if name in target_str:
            target_str = target_str.replace(name, f'<rs>{name}</rs>')
    
    #remove the original <p> and replace it with the new one,
    #as an element formed from the new string 
    destination.remove(target)
    destination.insert(0,etree.fromstring(target_str))

print(etree.tostring(doc).decode())

In this case, the output should be:

<root>
   <div>
      <p>
         -I like <rs>James</rs>
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure <rs>James</rs> understands
         <hi><rs>Peter</rs></hi>
         's problems.
      </p></div>
   <div>
      <p>
         -I like <rs>Mary</rs>
         <stage>
            <hi>he said to her</hi>
         </stage>
         , but I am not sure <rs>Peter</rs> understands
         <hi><rs>James</rs></hi>
         's problems.
      </p></div>
</root>
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45