How to ignore annotation characters on SyntaxNet?

Question

I want to ignore annotation characters when parsing text on syntaxnet.

For example, in the case below, I want to ignore <X> and </X> annotation characters.

<PERSON>Michael Jordan</PERSON> is a professor at <LOC>Berkeley</LOC>.

So, I expect next output.

_    <PERSON>    _     ...
1    Michael     _     ...
2    Jordan      _     ...
_    </PERSON>   _     ...
3    is          _     ...
...

Isn't SyntaxNet has such kind of features?

score 0 · Answer 1 · edited May 23 '17 at 12:32

0

No, SyntaxNet does not have specific features to manipulate xml tags. However you can preprocess your data easily in Python with something like:

import xml.etree.ElementTree as ET
tree = ET.fromstring(
    "<DOC><PERSON>Michael Jordan</PERSON> is a "
    "professor at <LOC>Berkeley</LOC>.</DOC>")
notags = ET.tostring(tree, encoding='utf8', method='text')
print(notags)

See also Python strip XML tags from document.

edited May 23 '17 at 12:32

Community

1
1

answered Aug 31 '16 at 20:12

calberti

106
3

Thanks. But I think if remove XML tags from text, it is hard to merge SyntaxNet outpust and XML tags. I want to use SyntaxNet outputs and XML annotation information to another machine learning's feature. – mayo Sep 01 '16 at 02:23

How to ignore annotation characters on SyntaxNet?

1 Answers1