I can use Tiktoken's ck_100k_base
Tokenizer to encode text data.
import tiktoken
enc = tiktoken.get_encoding("ck_100k_base")
ids = enc.encode_ordinary('hello world')
print(ids)
which will tokenized output:
[15339, 1917]
While in HuggingFace, I use bert-base-uncased
as a tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def preprocess_dataset(examples):
inputs = [prefix + example[source_lang] for example in examples["translation"]]
targets = [example[target_lang] for example in examples["translation"]]
model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
return model_inputs
source_lang = "en"
target_lang = "fr"
prefix = "Translate English to French: "
tokenized = my_dataset.map(preprocess_dataset, batched=True)
My question is, how to use tiktoken
's ck_100k_base
to replace BERT as tokenizer in HuggingFace's environment?