1

I can use Tiktoken's ck_100k_base Tokenizer to encode text data.

import tiktoken
enc = tiktoken.get_encoding("ck_100k_base")
ids = enc.encode_ordinary('hello world')
print(ids)

which will tokenized output:

[15339, 1917]

While in HuggingFace, I use bert-base-uncased as a tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def preprocess_dataset(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

source_lang = "en"
target_lang = "fr"
prefix = "Translate English to French: "
tokenized = my_dataset.map(preprocess_dataset, batched=True)

My question is, how to use tiktoken's ck_100k_base to replace BERT as tokenizer in HuggingFace's environment?

cronoik
  • 15,434
  • 3
  • 40
  • 78
Raptor
  • 53,206
  • 45
  • 230
  • 366
  • 2
    From a technical perspective -> yes, but from a performance perspective -> no. You can not replace a tokenizer with a certain vocabulary with a tokenizer that has a different vocabulary without changing/ training the model as well. – cronoik Apr 19 '23 at 11:33

0 Answers0