0

I want to ignore annotation characters when parsing text on syntaxnet.

For example, in the case below, I want to ignore <X> and </X> annotation characters.

<PERSON>Michael Jordan</PERSON> is a professor at <LOC>Berkeley</LOC>.

So, I expect next output.

_    <PERSON>    _     ...
1    Michael     _     ...
2    Jordan      _     ...
_    </PERSON>   _     ...
3    is          _     ...
...

Isn't SyntaxNet has such kind of features?

techraf
  • 64,883
  • 27
  • 193
  • 198
mayo
  • 81
  • 6

1 Answers1

0

No, SyntaxNet does not have specific features to manipulate xml tags. However you can preprocess your data easily in Python with something like:

import xml.etree.ElementTree as ET
tree = ET.fromstring(
    "<DOC><PERSON>Michael Jordan</PERSON> is a "
    "professor at <LOC>Berkeley</LOC>.</DOC>")
notags = ET.tostring(tree, encoding='utf8', method='text')
print(notags)

See also Python strip XML tags from document.

Community
  • 1
  • 1
calberti
  • 106
  • 3
  • Thanks. But I think if remove XML tags from text, it is hard to merge SyntaxNet outpust and XML tags. I want to use SyntaxNet outputs and XML annotation information to another machine learning's feature. – mayo Sep 01 '16 at 02:23