Export text from xml with self-closing tag

Question

I have a set of XML TEI files, containing transcriptions of document. I would like to parse these XML file and extract only text informations.

My XML looks like:

<?xml version='1.0' encoding='UTF8'?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <text>
    <body>
      <ab>
        <pb n="page1"/>
          <cb n="1"/>
            <lb xml:id="DD1" n="1"/>my sentence 1
            <lb xml:id="DD2" n="2"/>my sentence 2
            <lb xml:id="DD3" n="3"/>my sentence 3
          <cb n="2"/>
            <lb xml:id="DD1" n="1"/>my sentence 4
            <lb xml:id="DD2" n="2"/>my sentence 5
        <pb n="page2"/>
          <cb n="1"/>
            <lb xml:id="DD1" n="1"/>my sentence 1
            <lb xml:id="DD2" n="2"/>my sentence 2
          <cb n="2"/>
            <lb xml:id="DD1" n="1"/>my sentence 3
            <lb xml:id="DD1" n="2"/>my sentence 4
      </ab>
    </body>
  </text>
</TEI>

I have tried with LXML to access to the informations, by doing:

with open(file,'r') as my_file:
    
    root = ET.parse(my_file, parser = ET.XMLParser(encoding = 'utf-8'))
    list_pages = root.findall('.//{http://www.tei-c.org/ns/1.0}pb')
    for page in list_pages:
        liste_text = page.findall('.//{http://www.tei-c.org/ns/1.0}lb')
    
    final_text = []
    
    for content in liste_text:
        final_text.append(content.text)

I would like to have at the end something like:

page1
my sentence 1
my sentence 2
my sentence 3
my sentence 4
my sentence 5
page2
my sentence 1
my sentence 2
my sentence 3
my sentence 4

If I succeed to access to lb objects, no textual informations are linked to them. Could you please help me to extract these informations? Thanks

score 1 · Accepted Answer · answered Feb 19 '23 at 23:31

Note that your xml may have a problem in that you have several xml:id attributes with identical attribute values. Well formed xml requires the value to be unique within the XML document.

Assuming that's fixed, it would be easier to do if you lxml instead of ElementTree, because of lxml's better xpath support:

from lxml import etree
root = etree.parse(my_file)
for p in root.xpath('//*[name()="pb"]'):
    print(p.xpath('./@n')[0].strip())
    for lb in p.xpath('.//following-sibling::*[not(name()="cb")]'):
        if lb.xpath('name()') == "pb":
            break
        else:
            print(lb.tail.strip())

The output should be your expected output.

Export text from xml with self-closing tag

1 Answers1