1

I am new to GATE. I was trying to analyse the performance of different tools on a wide range of corpus. The problem is the diff tool or corpus QA tool require the annotation sets to be identical -even case sensitive. Indeed, each system has its own schema and generate different labels. For example: organisation in one system is Org in the other.

Is there a way to normalise these schemas to be able to compare between different systems?

Juan Carlos Farah
  • 3,829
  • 30
  • 42
  • Good question :) For these purposes I usually use the groovy console to merge corpora, rename annotations, add annotation sets, etc. – Yasen Feb 27 '15 at 08:54

1 Answers1

1

In such cases (renaming, adding empty annotation sets, ...) I recommend to work on the exported XML of a corpus:

Rightclick on corpus -> Save as ... -> GATE XML

If you look at the exported files you see the annotation sets at the end of the files (after your actual data) like this:

... data ...
</TextWithNodes>

<AnnotationSet Name="myAnnotationSet">
  <Annotation Id="1" Type="AnnotationName" StartNode="11" EndNode="111">
    <Feature>
      <Name className="java.lang.String">feature-key</Name>
      <Value className="java.lang.String">feature-value</Value>
    </Feature>
    ...
  </Annotation>
  ...
</AnnotationSet>
...

Simply replace whatever you need e.g. with

find . -name '*.xml' -exec sed -i 's/\>feature-key</>new-key</g' "{}" \;

(assumung that the phrase >feature-key< is nowhere else in the document) or with your favourite text exitor and re-import the corpus again

Rightclick on an (empty) corpus -> populate
thorsten
  • 481
  • 5
  • 10
  • I overlooked the comment to your question by @Yasen , which is also a pretty groovy way of course :). – thorsten Feb 27 '15 at 16:58