On an NLP project, I need to process the text and annotations in the TextWithNodes element of a GATE XML document. This element has the following appearance:
<TextWithNodes>
Some kind of sample <Node id="123" /> text
in which <Node id="234" /> various Node
elements appear.
</TextWithNodes>
The Node elements then correlate by ID attribute with annotations later in the same file. This is apparently standard GATE syntax. However, using the xml.etree.ElementTree module in Python, I cannot seem to capture the whole content of the TextWithNodes element. If I enter this--
>>> tree = ET.parse('my_file.xml')
>>> twn = tree.find('TextWithNodes')
>>> twn.text
Some kind of sample
That is, I only get the first fragment of text before the first Node element. How can I get the whole chunk of text, with the Node elements embedded in it? Or is there a better way of goint about this? I want, eventually, to turn the whole text content into a list of sentences, each element of which has the text of the sentence paired with a dictionary from Node ID to the text content of the corresponding annotation--something like that. Thanks.