SyntaxNet Tokenization for Italian and Spanish

Question

We're trying to use SyntaxNet on English, Italian and Spanish with models pretrained on Universal Dependencies datasets that we found here https://github.com/tensorflow/models/blob/master/syntaxnet/universal.md.

For Italian and Spanish we are encountering some problems at the level of tokenisation for contractions and clitics. Contractions are a combination of a preposition and a determiner, so we want them to be split in the two parts. We noticed that the tokeniser always fails in doing so, which means that the whole analysis of the sentence becomes wrong. The same happens for clitics.

We are launching the models as follows:

MODEL_DIRECTORY=../pretrained/Italian
cat /mnt/test_ita.split | syntaxnet/models/parsey_universal/tokenize.sh \
                        $MODEL_DIRECTORY > /mnt/test_ita.tokenized

Below, an example of the output we are obtaining now and the one we wish to have.

Italian (SyntaxNet analisys)

1       Sarebbe _       VERB    V       Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++V      2       cop     _   _
2       bello   _       ADJ     A       Gender=Masc|Number=Sing|fPOS=ADJ++A     0       ROOT    _       _
3       esserci _       PRON    PE      fPOS=NOUN++S    2       nsubj   _       _
4       .       _       PUNCT   FS      fPOS=PUNCT++FS  2       punct   _       _

Italian (desired output)

1       Sarebbe _       VERB    V       Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++V      2       cop     _   _
2       bello   _       ADJ     A       Gender=Masc|Number=Sing|fPOS=ADJ++A     0       ROOT    _       _
3       esser   _       VERB    V       VerbForm=Inf|fPOS=VERB++V       2       csubj   _       _
4       ci      _       PRON    PC      PronType=Clit|fPOS=PRON++PC     3       advmod  _       _

How can we handle this problem? Thanks in advance.

SyntaxNet Tokenization for Italian and Spanish

0 Answers0