5

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.

from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
sentence = "The Natural Science Museum of Madrid shows the RECONSTRUCTION of a dinosaur"

tokens = tz.tokenize(sentence)
print(tokens)

>>['The', 'Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC', '##ON', '##ST', '##R', '##UC', '##TI', '##ON', 'of', 'a', 'dinosaur']

What I want is to get the text corresponding to 4 tokens to the left and to the right of the token Madrid. So i want the tokens: ['Natural', 'Science', 'Museum', 'of', 'Madrid', 'shows', 'the', 'R', '##EC'] and then transform them into the original text. In this case it would be 'Natural Science Museum of Madrid shows the REC'.

Is there a way to do this?

cronoik
  • 15,434
  • 3
  • 40
  • 78
JayJay
  • 173
  • 1
  • 6

2 Answers2

10

In addition to the information provided by Jindrich about the information loss, I want to add that huggingface provides a build-in method to convert tokens to a string (the lost information remains lost!). The method is called convert_tokens_to_string:

tz.convert_tokens_to_string(tokens[1:10])

Output:

'Natural Science Museum of Madrid shows the REC'
cronoik
  • 15,434
  • 3
  • 40
  • 78
  • But can I be sure that when I tokenize the untokenized sentence (i.e., 'Natural Science Museum of Madrid shows the REC'), the resulting tokens are the same as the original ones? – JayJay Feb 21 '21 at 23:12
  • What do you mean with original ones? You can not be sure that u can reconstruct the string as Jindrich has explained in his answer. Another example are unknown tokens `[UNK]`. For example, the following leads to an `[UNK]`: `t.tokenize("The Natural Science Museum of Madrid ")` which means you lose information during tokenization. @JayJay – cronoik Feb 22 '21 at 01:32
5

BERT uses word-piece tokenization that is unfortunately not loss-less, i.e., you are never guaranteed to get the same sentence after detokenization. This is a big difference from RoBERTa that uses SentencePiece that is fully revertable.

You can get the so-called pre-tokenized text where merging tokens starting with ##.

pretok_sent = ""
for tok in tokens:
     if tok.startswith("##"):
         pretok_sent += tok[2:]
     else:
         pretok_sent += " " + tok
pretok_sent = pretok_sent[1:]

This snippet reconstructs the sentence in your example, but note that if the sentence would contain punctuation, the punctuation will remain separated from the other tokens, which is the pre-tokenization. The sentence can look like this:

'This is a sentence ( with brackets ) .'

Going from the pre-tokenized to a standard sentence is the lossy step (you can never know if and how many extra spaces were in the original sentence). You can get a standard sentence by applying detokenization rules, such as in sacremoses.

import sacremoses
detok = sacremoses.MosesDetokenizer('en')
detok(sent.split(" "))

This results in:

'This is a sentence (with brackets).'
Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • This isn't exactly true because if you use a Pipeline it returns the original `start` and `end` index from the string with the prediction. This means you can recreate the text but you could not disambiguate whitespace character type – Matt Jun 10 '22 at 11:52