How to create huggingface tokenizer from a "char_to_idx" dict?

Question

Given a dictionary char_to_idx how can one create a tokenizer such that the ids of the tokens are guaranteed to be the same as in char_to_idx?

char_to_idx = {'a': 0, 'b': 1, 'c': 2, 'd': 3}
tokenizer  = tokenizers.Tokenizer(tokenizers.models.Unigram())
# ???
print(tokenizer.get_vocab())
# {'a': 0, 'b': 1, 'c': 2, 'd': 3}

score 0 · Answer 1 · answered Jun 16 '23 at 17:55

0

I have a simple way to do it:

char_to_idx = {'a': 0, 'b': 1, 'c': 2, 'd': 3}

# This will do exactly opposite to char_to_idx as idx_to_char 
itos = {i:ch for i,ch in enumerate(char_to_idx)}

After this:

#This will make sure that you will get exact values for characters

decode = lambda l: ''.join(itos[i] for i in l)

answered Jun 16 '23 at 17:55

Harshad Patil

261
2
8

Yes but that's not what I was asking for, thank you for an alternative solution. – Yorai Levi Jun 16 '23 at 17:59

How to create huggingface tokenizer from a "char_to_idx" dict?

1 Answers1