In HuggingFace tokenizers: how can I split a sequence simply on spaces?

Question

I am using DistilBertTokenizer tokenizer from HuggingFace.

I would like to tokenize my text by simple splitting it on space:

["Don't", "you", "love", "", "Transformers?", "We", "sure", "do."]

instead of the default behavior, which is like this:

["Do", "n't", "you", "love", "", "Transformers", "?", "We", "sure", "do", "."]

I read their documentation about Tokenization in general as well as about BERT Tokenizer specifically, but could not find an answer to this simple question :(

I assume that it should be a parameter when loading Tokenizer, but I could not find it among the parameters list ...

EDIT: Minimal code example to reproduce:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')

tokens = tokenizer.tokenize("Don't you love  Transformers? We sure do.")
print("Tokens: ", tokens)

Try passing `sep_token=' '`, or setting `tokenizer.sep_token = ' '` — Wiktor Stribiżew, Feb 05 '21 at 13:54
@WiktorStribiżew, I tried, didn't help. Unfortunately. Thank you for the tip — Taras Kucherenko, Feb 05 '21 at 14:06

cronoik · Accepted Answer · 2021-02-06T08:40:37.130

That is not how it works. The transformers library provides different types of tokenizers. In the case of distilbert it is a wordpiece tokenizer that has a defined vocabulary that was used to train the corresponding model and therefore does not offer such modifications (as far as I know). Something you can do is using the split() method of the python string:

text = "Don't you love  Transformers? We sure do."
tokens = text.split()
print("Tokens: ", tokens)

Output:

Tokens:  ["Don't", 'you', 'love', '', 'Transformers?', 'We', 'sure', 'do.']

In case you are looking for a bit more complex tokenization that also takes the punctuation into account, you can utilize the basic_tokenizer:

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.basic_tokenizer.tokenize(text)
print("Tokens: ", tokens)

Output:

Tokens:  ['Don', "'", 't', 'you', 'love', '', 'Transformers', '?', 'We', 'sure', 'do', '.']

That sounds great. Thank you for the tip, @cronoik. And so I could encode a sentence using BERT after I split it into tokens myself, right? — Taras Kucherenko, Feb 06 '21 at 08:20
@TarasKucherenko: It depends. You can for example train your own BERT with whitespace tokenization or any other approach. But when you use a pre-trained BERT you have to use the same tokenization algorithm, because a pre-trained model has learned vector representations for each token and you can not simply change the tokenization approach without losing the benefit of a pre-trained model. — cronoik, Feb 06 '21 at 08:45

Taras Kucherenko · Answer 2 · 2021-02-12T08:15:31.717

0

EDIT: this does not do what I wanted as pointed out in the comments.

Here is one idea I have tried:

from transformers import DistilBertModel, DistilBertTokenizer
import torch

text_str = "also du fängst an mit der Stadtrundfahrt"

# create DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-german-cased')
model = DistilBertModel.from_pretrained('distilbert-base-german-cased')

# check if tokens are correct
tokens = tokenizer.basic_tokenizer.tokenize(text_str)
print("Tokens: ", tokens)

# Encode the curent text
input_ids = torch.tensor(tokenizer.encode(tokens)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
print(last_hidden_states.shape)
print(last_hidden_states[0,1:-1].shape)

print(last_hidden_states)

The key is to split on tokens first using BasicTokenizer (as proposed by @cronoik) and then use already tokenized text when encoding it.

edited Feb 12 '21 at 08:15

answered Feb 08 '21 at 09:14

Taras Kucherenko

103
3
10

Not sure if you are aware of it but I think this is not what you want to do. Please check the output of `tokenizer.convert_ids_to_tokens(input_ids.tolist()[0])`. What you see here is, that `fängst` and `Stadtrundfahrt` are encoded with the same id because they are not part of the dictionary. @TarasKucherenko – cronoik Feb 08 '21 at 11:38
Oh, yes, you are absolutely right, @cronoik . So it seems that what I want is indeed not possible yet. I open an issue request for it: https://github.com/huggingface/transformers/issues/10058 – Taras Kucherenko Feb 08 '21 at 16:45
1

It is not like that they need to modify their code to make that possible. What you are requesting is a full dictionary transformer. You can use each of their transformers models, but you need to train them by yourself. There is no pre-trained model available as far as I know. According to this [website](https://www.welt.de/kultur/article124064744/Die-deutsche-Sprache-hat-5-3-Millionen-Woerter.html), it would require 5.3 million entries. There are very good reasons (e.g. training data, training time) why subword tokenization is the current state of the art approach. – cronoik Feb 12 '21 at 07:35
Right, good point @cronoik. I edited this proposed answer to reflect it. – Taras Kucherenko Feb 12 '21 at 08:15

In HuggingFace tokenizers: how can I split a sequence simply on spaces?

2 Answers2