How can I map the tokens I get from huggingface DistilBertTokenizer
to the positions of the input text?
e.g. I have a new GPU
-> ["i", "have", "a", "new", "gp", "##u"]
-> [(0, 1), (2, 6), ...]
I'm interested in this because suppose that I have some attention values assigned to each token, I would like to show which part of the original text it actually corresponds to, since the tokenized version is not non-ML people friendly.
I have not found any solution to this yet especially when there is [UNK]
token. Any insights would be appreciated. Thank you!