0

For example, if I have the words MKIK or "牛逼" (which is artificially created) how can we tell neural networks (transformer model) to keep the same output?

The problem is with using the transformer model on fairseq.

I found fairseq has --replace-unk parameters, but it doesn't seem to work on transformer model or it has a bug

xihajun
  • 37
  • 4

2 Answers2

0

I have an idea myself, pretrain a naive model with all of the unknown tokens, like Chinese characters. Then finetune the model without those unknown tokens.

I guess in this way the neural network connections will not update?

But I will have to play around the structure and see.

xihajun
  • 37
  • 4
0

It seems there is a bug in FairSeq (see a GitHub issues) saying that the --replace-unk option fails if the source-side token is also <unk>.

If you train your models from scratch, a workaround might be using a tokenizer that never produces <unk>s, such as SentencePiece with byte fallback that splits unknown tokens eventually up to the byte level and all bytes are in the vocabulary.

Jindřich
  • 10,270
  • 2
  • 23
  • 44