0

I have an annotated dataset in txt.knowtator.xml format

<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
    <annotation>
        <mention id="EHOST_Instance_93" />
        <annotator id="01">Unknown</annotator>
        <span start="127" end="237" />
        <spannedText>Omeprazole</spannedText>
        <creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_93">
        <mentionClass id="Treatment">Omeprazole</mentionClass>
    </classMention>
    <annotation>
        <mention id="EHOST_Instance_94" />
        <annotator id="01">Unkown</annotator>
        <span start="600" end="612" />
        <spannedText>Tegretol</spannedText>
        <creationDate>Wed Mar 11 09:55:11 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_94">
        <mentionClass id="Treatment">Tegretol</mentionClass>
</annotations>

I need to get it into standoff BRAT format (.ann), such as:

T1    Treatment 127 137    Omeprazole
T2    Treatment 600 612    Tegretol

Is there any available tool for converting/parsing?

torakxkz
  • 483
  • 5
  • 17

1 Answers1

1

see below

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
    <annotation>
        <mention id="EHOST_Instance_93" />
        <annotator id="01">Unknown</annotator>
        <span start="127" end="237" />
        <spannedText>Omeprazole</spannedText>
        <creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_93">
        <mentionClass id="Treatment">Omeprazole</mentionClass>
    </classMention>
</annotations>'''

root = ET.fromstring(xml)
print(f'T1    Treatment {root.find(".//span").attrib["start"]} {root.find(".//span").attrib["end"]} {root.find(".//spannedText").text}')

output

T1    Treatment 127 237 Omeprazole
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Thank you, that works: "Treatment" token can be collected with root.find(".//mentionSlot").attrib["id"]. However it only works for the first annotation. I have edited my question – torakxkz Apr 12 '22 at 12:41
  • @torakxkz I have answered your question as you posted it. Please accept the answer. If you have a new question - create a new post. – balderman Apr 12 '22 at 12:47