I have a set of XML TEI files, containing transcriptions of document. I would like to parse these XML file and extract only text informations.
My XML looks like:
<?xml version='1.0' encoding='UTF8'?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<text>
<body>
<ab>
<pb n="page1"/>
<cb n="1"/>
<lb xml:id="DD1" n="1"/>my sentence 1
<lb xml:id="DD2" n="2"/>my sentence 2
<lb xml:id="DD3" n="3"/>my sentence 3
<cb n="2"/>
<lb xml:id="DD1" n="1"/>my sentence 4
<lb xml:id="DD2" n="2"/>my sentence 5
<pb n="page2"/>
<cb n="1"/>
<lb xml:id="DD1" n="1"/>my sentence 1
<lb xml:id="DD2" n="2"/>my sentence 2
<cb n="2"/>
<lb xml:id="DD1" n="1"/>my sentence 3
<lb xml:id="DD1" n="2"/>my sentence 4
</ab>
</body>
</text>
</TEI>
I have tried with LXML to access to the informations, by doing:
with open(file,'r') as my_file:
root = ET.parse(my_file, parser = ET.XMLParser(encoding = 'utf-8'))
list_pages = root.findall('.//{http://www.tei-c.org/ns/1.0}pb')
for page in list_pages:
liste_text = page.findall('.//{http://www.tei-c.org/ns/1.0}lb')
final_text = []
for content in liste_text:
final_text.append(content.text)
I would like to have at the end something like:
page1
my sentence 1
my sentence 2
my sentence 3
my sentence 4
my sentence 5
page2
my sentence 1
my sentence 2
my sentence 3
my sentence 4
If I succeed to access to lb objects, no textual informations are linked to them. Could you please help me to extract these informations? Thanks