How to predict token for neural machine translation

Question

For example, if I have the words MKIK or "牛逼" (which is artificially created) how can we tell neural networks (transformer model) to keep the same output?

The problem is with using the transformer model on fairseq.

I found fairseq has --replace-unk parameters, but it doesn't seem to work on transformer model or it has a bug

Please provide enough code so others can better understand or reproduce the problem. — Community, Sep 25 '22 at 19:57

score 0 · Answer 1 · answered Sep 25 '22 at 19:52

I have an idea myself, pretrain a naive model with all of the unknown tokens, like Chinese characters. Then finetune the model without those unknown tokens.

I guess in this way the neural network connections will not update?

But I will have to play around the structure and see.

score 0 · Answer 2 · answered Sep 29 '22 at 08:35

0

It seems there is a bug in FairSeq (see a GitHub issues) saying that the --replace-unk option fails if the source-side token is also <unk>.

If you train your models from scratch, a workaround might be using a tokenizer that never produces <unk>s, such as SentencePiece with byte fallback that splits unknown tokens eventually up to the byte level and all bytes are in the vocabulary.

answered Sep 29 '22 at 08:35

Jindřich

10,270
2
23
44

Thanks! How about the Special characters like :., and numbers? – xihajun Oct 01 '22 at 20:42

How to predict token for neural machine translation

2 Answers2