2

How can I map the tokens I get from huggingface DistilBertTokenizer to the positions of the input text?

e.g. I have a new GPU -> ["i", "have", "a", "new", "gp", "##u"] -> [(0, 1), (2, 6), ...]

I'm interested in this because suppose that I have some attention values assigned to each token, I would like to show which part of the original text it actually corresponds to, since the tokenized version is not non-ML people friendly.

I have not found any solution to this yet especially when there is [UNK] token. Any insights would be appreciated. Thank you!

Hardian Lawi
  • 588
  • 5
  • 22
  • Are you talking about character mappings? Do you necessarily have to show each individual (subword) token, or would it be an alternative to average over subwords and merge them back together (e.g., averaging over `gp` and `##u`)? – dennlinger Nov 25 '21 at 10:05

1 Answers1

2

In the newer versions of Transformers (it seems like since 2.8), calling the tokenizer returns an object of class BatchEncoding when methods __call__, encode_plus and batch_encode_plus are used. You can use method token_to_chars that takes the indices in the batch and returns the character spans in the original string.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • 1
    Just a note, [`encode_plus` is now deprecated](https://huggingface.co/docs/transformers/v4.16.2/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode_plus), and `__call__` should be used instead (that is rather than doing `tokenizer.encode_plus("My input string")`, do `tokenizer("My input string")`; it returns a BatchEncoding object which has the method `token_to_chars`). – postylem Feb 08 '22 at 01:45