Extracting annotations from GATE in xml

Question

How can I possibly retrieve the annotated texts from the document in a structured way as below. I am using a sentence as a unit of processing, meaning that I would like to retrieve specific texts from the sentences and put them together later. So, I have already setup my annotation in GATE and saved the annotated results as inline xml.

So my input xml file looks like this:

    <Document>
        <Paragraph>
            <text id="100">30.03. Zeraua joins the Otjimbingwe and Omaruru Ovaherero at Samuel’s station at Ongandjira in the upper Swakop valley.</text>
            <text id="101">01.04. Von Glasenapp’s unit proceeds in the direction of Otjikuoko without meeting the Tjetjo community.</text>
            <text id="102">09.04. The battle of Ongandjira is fought with heavy losses on both sides. The Ovaherero have to give way before a sustained German artillery bombardment commences, and they escape in a northerly direction.</text>
        </Paragraph>
         <Paragraph>
            <text id="200">30.03. Zeraua joins the Otjimbingwe and Omaruru Ovaherero at Samuel’s station at Ongandjira in the upper Swakop valley.</text>
            <text id="201">01.04. Von Glasenapp’s unit proceeds in the direction of Otjikuoko without meeting the Tjetjo community.</text>
            <text id="202">09.04. The battle of Ongandjira is fought with heavy losses on both sides. The Ovaherero have to give way before a sustained German artillery bombardment commences, and they escape in a northerly direction.</text>
        </Paragraph>
    </Document>

And this is my desired output structure per sentence to be:

    <text id="100">
        <Event>Battle of Ongandjira</Event>
        <Location>Ongandjira</Location>
        <NumberDate>30.03</NumberDate>
        <Person>Zeraua</Person>
    </text>

And this is my annotations in GATE:

My inline file just contain a lot of mixed up annotations and I cant figure out how to structure it in that order. I have tried the Format_Twitter JSON and its a mess too.

Thanks a lot.

score 1 · Answer 1 · answered Aug 07 '17 at 13:42

1

If I properly understood your requirements you should use the next approach (abstract description for Java code).

1) Load your annotated document.

2) In your java code implement get all annotation with type Sentence in document order.

3) Run loop over Sentence annotations and get Event, Location, NumberDate, Person within ever Sentence span.

4) For every annotation (Event, Location, NumberDate, Person) get text

5) Create your XML

answered Aug 07 '17 at 13:42

ashingel

494
3
11

Oh yes I figured that out. But the problem is I cant even interpret the xml from GATE. It does not contain the sentences anymore, it just has nodes allover.. – Nampa Gwakondo Aug 08 '17 at 15:37

Extracting annotations from GATE in xml

1 Answers1