How to untokenize BERT tokens?

Question

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.

from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"

tokens = tz.tokenize(sentence)
print(tokens)

>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']

What I want is to get the text corresponding to 4 tokens to the left and to the right of the token Madrid. So i want the tokens: ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] and then transform them into the original text. In this case it would be 'Natural Science Museum of Madrid shows the REC'.

Is there a way to do this?

cronoik · Answer 1 · 2021-02-20T13:35:29.763

10

In addition to the information provided by Jindrich about the information loss, I want to add that huggingface provides a build-in method to convert tokens to a string (the lost information remains lost!). The method is called convert_tokens_to_string:

tz.convert_tokens_to_string(tokens[1:10])

Output:

'Natural Science Museum of Madrid shows the REC'

edited Feb 20 '21 at 13:35

answered Feb 20 '21 at 13:19

cronoik

15,434
3
40
78

But can I be sure that when I tokenize the untokenized sentence (i.e., 'Natural Science Museum of Madrid shows the REC'), the resulting tokens are the same as the original ones? – JayJay Feb 21 '21 at 23:12
What do you mean with original ones? You can not be sure that u can reconstruct the string as Jindrich has explained in his answer. Another example are unknown tokens `[UNK]`. For example, the following leads to an `[UNK]`: `t.tokenize("The Natural Science Museum of Madrid ")` which means you lose information during tokenization. @JayJay – cronoik Feb 22 '21 at 01:32

score 5 · Answer 2 · answered Feb 17 '21 at 09:04

BERT uses word-piece tokenization that is unfortunately not loss-less, i.e., you are never guaranteed to get the same sentence after detokenization. This is a big difference from RoBERTa that uses SentencePiece that is fully revertable.

You can get the so-called pre-tokenized text where merging tokens starting with ##.

pretok_sent = ""
for tok in tokens:
     if tok.startswith("##"):
         pretok_sent += tok[2:]
     else:
         pretok_sent += " " + tok
pretok_sent = pretok_sent[1:]

This snippet reconstructs the sentence in your example, but note that if the sentence would contain punctuation, the punctuation will remain separated from the other tokens, which is the pre-tokenization. The sentence can look like this:

'This is a sentence ( with brackets ) .'

Going from the pre-tokenized to a standard sentence is the lossy step (you can never know if and how many extra spaces were in the original sentence). You can get a standard sentence by applying detokenization rules, such as in sacremoses.

import sacremoses
detok = sacremoses.MosesDetokenizer('en')
detok(sent.split(" "))

This results in:

'This is a sentence (with brackets).'

This isn't exactly true because if you use a Pipeline it returns the original `start` and `end` index from the string with the prediction. This means you can recreate the text but you could not disambiguate whitespace character type — Matt, Jun 10 '22 at 11:52

How to untokenize BERT tokens?

2 Answers2