1

Given a dictionary char_to_idx how can one create a tokenizer such that the ids of the tokens are guaranteed to be the same as in char_to_idx?

char_to_idx = {'a': 0, 'b': 1, 'c': 2, 'd': 3}
tokenizer  = tokenizers.Tokenizer(tokenizers.models.Unigram())
# ???
print(tokenizer.get_vocab())
# {'a': 0, 'b': 1, 'c': 2, 'd': 3}
Yorai Levi
  • 473
  • 5
  • 17

1 Answers1

0

I have a simple way to do it:

char_to_idx = {'a': 0, 'b': 1, 'c': 2, 'd': 3}

# This will do exactly opposite to char_to_idx as idx_to_char 
itos = {i:ch for i,ch in enumerate(char_to_idx)}

After this:

#This will make sure that you will get exact values for characters

decode = lambda l: ''.join(itos[i] for i in l)
Harshad Patil
  • 261
  • 2
  • 8