Parsing the TextWithNodes element of a GATE document in Python

Question

On an NLP project, I need to process the text and annotations in the TextWithNodes element of a GATE XML document. This element has the following appearance:

<TextWithNodes>
Some kind of sample <Node id="123" /> text 
in which <Node id="234" /> various Node 
elements appear.
</TextWithNodes>

The Node elements then correlate by ID attribute with annotations later in the same file. This is apparently standard GATE syntax. However, using the xml.etree.ElementTree module in Python, I cannot seem to capture the whole content of the TextWithNodes element. If I enter this--

>>> tree = ET.parse('my_file.xml')
>>> twn = tree.find('TextWithNodes')
>>> twn.text
Some kind of sample

That is, I only get the first fragment of text before the first Node element. How can I get the whole chunk of text, with the Node elements embedded in it? Or is there a better way of goint about this? I want, eventually, to turn the whole text content into a list of sentences, each element of which has the text of the sentence paired with a dictionary from Node ID to the text content of the corresponding annotation--something like that. Thanks.

The missing text pieces are stored in the `tail` property of the `Node` elements. See https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.tail. See also https://stackoverflow.com/q/3683997/407651, https://stackoverflow.com/q/66182075/407651 (and many other questions). — mzjn, Jul 07 '21 at 05:08

Parsing the TextWithNodes element of a GATE document in Python

0 Answers0