20

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function?
For example:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str) 
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str) 
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids']) 
## '<s> this is a tokenization example</s>'

And the objective is to have a function that maps each token in the decode process to the correct input word, for here it will be:
desired_output = [[1],[2],[3],[4,5],[6]]
As this corresponds to id 42, while token and ization corresponds to ids [19244,1938] which are at indexes 4,5 of the input_ids array.

cronoik
  • 15,434
  • 3
  • 40
  • 78
DsCpp
  • 2,259
  • 3
  • 18
  • 46

2 Answers2

10

As far as I know their is no built-in method for that, but you can create one by yourself:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})

Output:

{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}

To get exactly your desired output, you have to work with a list comprehension:

#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1

enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]

desired_output = []

for token in enc:
    tokenoutput = []
    for ids in token:
      tokenoutput.append(idx)
      idx +=1
    desired_output.append(tokenoutput)

print(desired_output)

Output:

[[1], [2], [3], [4, 5], [6]]
cronoik
  • 15,434
  • 3
  • 40
  • 78
  • Thanks. I need the indexes of the relevant encoded tokens, after adding the special tokens, as later on I'm averaging the feature outputs of the model on those idxs(you can see that our example outputs are different ) – DsCpp Jun 13 '20 at 05:38
  • Nether the less, if no such default function exists, it's pretty straightforward, I'll write it and post it here, thanks – DsCpp Jun 13 '20 at 05:39
  • It's a workaround, as it ignores other special tokens, like etc. But the general Idea is clear, thanks:) – DsCpp Jun 13 '20 at 12:45
8

If you use the fast tokenizers, i.e. the rust backed versions from the tokenizers library the encoding contains a word_ids method that can be used to map sub-words back to their original word. What constitutes a word vs a subword depends on the tokenizer, a word is something generated by the pre-tokenization stage, i.e. split by whitespace, a subword is generated by the actual model (BPE or Unigram for example).

The code below should work in general, even if the pre-tokenization performs additional splitting. For example I created my own custom step that splits based on PascalCase - the words here are Pascal and Case, the accepted answer wont work in this case since it assumes words are whitespace delimited.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

encoded = tokenizer(example)

desired_output = []
for word_id in encoded.word_ids():
    if word_id is not None:
        start, end = encoded.word_to_tokens(word_id)
        if start == end - 1:
            tokens = [start]
        else:
            tokens = [start, end-1]
        if len(desired_output) == 0 or desired_output[-1] != tokens:
            desired_output.append(tokens)
desired_output
David Waterworth
  • 2,214
  • 1
  • 21
  • 41