I am using DistilBertTokenizer
tokenizer from HuggingFace.
I would like to tokenize my text by simple splitting it on space:
["Don't", "you", "love", "", "Transformers?", "We", "sure", "do."]
instead of the default behavior, which is like this:
["Do", "n't", "you", "love", "", "Transformers", "?", "We", "sure", "do", "."]
I read their documentation about Tokenization in general as well as about BERT Tokenizer specifically, but could not find an answer to this simple question :(
I assume that it should be a parameter when loading Tokenizer, but I could not find it among the parameters list ...
EDIT: Minimal code example to reproduce:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.tokenize("Don't you love Transformers? We sure do.")
print("Tokens: ", tokens)