Is it possible to use Tiktoken's ck_100k_base Tokenizer in HuggingFace's pipeline?

Question

I can use Tiktoken's ck_100k_base Tokenizer to encode text data.

import tiktoken
enc = tiktoken.get_encoding("ck_100k_base")
ids = enc.encode_ordinary('hello world')
print(ids)

which will tokenized output:

[15339, 1917]

While in HuggingFace, I use bert-base-uncased as a tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def preprocess_dataset(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

source_lang = "en"
target_lang = "fr"
prefix = "Translate English to French: "
tokenized = my_dataset.map(preprocess_dataset, batched=True)

My question is, how to use tiktoken's ck_100k_base to replace BERT as tokenizer in HuggingFace's environment?

From a technical perspective -> yes, but from a performance perspective -> no. You can not replace a tokenizer with a certain vocabulary with a tokenizer that has a different vocabulary without changing/ training the model as well. — cronoik, Apr 19 '23 at 11:33

Is it possible to use Tiktoken's ck_100k_base Tokenizer in HuggingFace's pipeline?

0 Answers0