2

I am new to NLP and Transformers library. Perhaps my doubt is naive but I am not finding a good solution for it.

I have documents whose content in sensitive and it is a requirements of mine not to publish it clearly on cloud. However my model is running on a Cloud Virtual Machine.

My idea would be to perform OCR and Tokenization on premise and then uploading the results.

However, tokenization with PreTrainedTokenizer by Transformers library returns the ids of the token from its vocabulary, and everyone can decode it having the same pretrained model.

So here is the question: it is possible to fine-tune or just change the vocabulary index so that the tokenization can't be easily decoded?

xxfeffo
  • 41
  • 2

0 Answers0