Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
1
vote
1 answer
How to add new tokens to an existing Huggingface tokenizer?
How to add new tokens to an existing Huggingface AutoTokenizer?
Canonically, there's this tutorial from Huggingface https://huggingface.co/learn/nlp-course/chapter6/2 but it ends on the note of "quirks when using existing tokenizers". And then it…

alvas
- 115,346
- 109
- 446
- 738
1
vote
0 answers
Error when loading a BERT model using load_model after adding a new token to the tokenizer
I am getting this error when trying to load the saved .h5 model in tensorflow:
load_model("path_to_model.h5", custom_objects={"TFBertModel": TFBertModel, "AdamWeightDecay": AdamWeightDecay})
ValueError: Cannot assign value to variable
'…

richard
- 11
- 1
1
vote
0 answers
The tokenizer Doesn't recognize the new special tokens
When I run the code below, the tokenizer doesn't recognize the new special tokens that I added ([SP] and [EMPTY]). I wanted to tokenize arabic text.
from tokenizers import BertWordPieceTokenizer
from transformers import…

FQ912
- 11
- 2
1
vote
0 answers
DataFrame text tokenization with Hugging Face is not working
I have a DataFrame with text I want to tokenize using the Hugging Face library. When running the code, the "Tokenized Text" column returns empty. How can this be solved? The code is as follows:
df = pd.read_csv('subject_messages.csv')
import…

Mark Davidson
- 23
- 3
1
vote
1 answer
GPU out of memory fine tune flan-ul2
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB
(GPU 0; 15.78 GiB total capacity; 14.99 GiB already allocated; 3.50
MiB free; 14.99 GiB reserved in total by PyTorch) If reserved memory
is >> allocated memory try setting…

Salty Gold Fish
- 431
- 5
- 14
1
vote
0 answers
Is it possible to use Tiktoken's ck_100k_base Tokenizer in HuggingFace's pipeline?
I can use Tiktoken's ck_100k_base Tokenizer to encode text data.
import tiktoken
enc = tiktoken.get_encoding("ck_100k_base")
ids = enc.encode_ordinary('hello world')
print(ids)
which will tokenized output:
[15339, 1917]
While in HuggingFace, I use…

Raptor
- 53,206
- 45
- 230
- 366
1
vote
1 answer
Unable to import transformers.models.bert.modeling_tf_bert on macOS?
As the title is self-descriptive, I'm not able to import the BertTokenizer and TFBertModel classes from the transformers package through the following code:
from transformers import BertTokenizer, TFBertModel
tokenizer =…

talha06
- 6,206
- 21
- 92
- 147
1
vote
1 answer
Avoiding Trimmed Summaries of a PEGASUS-Pubmed huggingface summarization model
I am new to huggingface.
I am using PEGASUS - Pubmed huggingface model to generate summary of the reserach paper. Following is the code for the same. the model gives a trimmed summary.
Any way of avoiding the trimmed summaries and getting more…

Simran
- 15
- 3
1
vote
1 answer
How to interpret the model_max_len attribute of the PreTrainedTokenizer object in Huggingface Transformers
I've been trying to check the maximum length allowed by emilyalsentzer/Bio_ClinicalBERT, and after these lines of code:
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer
I've obtained the…

ignacioct
- 325
- 1
- 12
1
vote
0 answers
Conflicting versions of adapter-transformers with versions of tokenizers
I am trying to install adapter-transformers=3.1.0, then it give following error. How could I find the compatible adapter-transformers version for tokenizers==0.9.2?
ERROR: pip's dependency resolver does not currently take into account all the…

Indunil Udayangana
- 41
- 1
- 3
1
vote
1 answer
How does one create a custom hugging face model that is compatible with the HF trainer?
I want to create a new hugging face (HF) architecture with some existing tokenizer (any one that is excellent is fine). Let's say decoder to make it concrete (but both is better).
How does one do this? I found this…

Charlie Parker
- 5,884
- 57
- 198
- 323
1
vote
1 answer
Train Tokenizer with HuggingFace dataset
I'm trying to train the Tokenizer with HuggingFace wiki_split datasets. According to the Tokenizers' documentation at GitHub, I can train the Tokenizer with the following codes:
from tokenizers import Tokenizer
from tokenizers.models import…

Raptor
- 53,206
- 45
- 230
- 366
1
vote
3 answers
tokenizer.push_to_hub(repo_name) is not working
I'm trying to puch my tokonizer to my huggingface repo...
it consist of the model vocab.Json (I'm making a speech recognition model)
My code:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] =…

FOXASDF
- 43
- 3
1
vote
1 answer
Huggingface token classification pipeline giving different outputs than just calling model() directly
I am trying to mask named entities in text, using a roberta based model.
The suggested way to use the model is via Huggingface pipeline but i find that it is rather slow to use it that way. Using a pipeline on text data also prevents me from using…

Bunnyrabbit
- 21
- 3
1
vote
0 answers
mT5 transformer, how to access encoder to compute cosine similarity
this is my method, my question is how to access the encoder be sending 2 sentences each time? because I have a dataset that contain pairs of sentences, and I need to compute the similarity between each pair.
//anyone could help?
model =…

Maria
- 11
- 2