Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
2
votes
0 answers
Your fast tokenizer does not have the necessary information to save the vocabulary for a slow tokenizer
I'm trying to fine tune a t5 model for paraphrasing Farsi sentences. I'm using this model as my base. My dataset is a paired sentence dataset which each row is a pair of paraphrased sentences. I want to fine tune the model on this dataset. The…

Ali Ghasemi
- 61
- 2
2
votes
1 answer
KeyError: 'eval_loss' in Hugginface Trainer
I am trying to build a Question Answering Pipeline with the Hugginface framework but facing the KeyError: 'eval_loss' error. My goal is to train and save the best model at last and evaluate the validation test on the loaded model. My trainer…

Aaditya Ura
- 12,007
- 7
- 50
- 88
2
votes
1 answer
How to know if HuggingFace's pipeline text input exceeds 512 tokens
I've finetuned a Huggingface BERT model for Named Entity Recognition based on 'bert-base-uncased'. I perform inference like this:
from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder,…

ClaudiaR
- 3,108
- 2
- 13
- 27
2
votes
1 answer
How to pass arguments to HuggingFace TokenClassificationPipeline's tokenizer
I've finetuned a Huggingface BERT model for Named Entity Recognition. Everything is working as it should. Now I've setup a pipeline for token classification in order to predict entities out the text I provide. Even this is working fine.
I know that…

ClaudiaR
- 3,108
- 2
- 13
- 27
2
votes
0 answers
Huggingface pre-trained model
I try to use the below code:
from transformers import AutoTokenizer, AutoModel
t = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(t)
model = AutoModel.from_pretrained(t)
The error: I think this error is due to the old version of…

Learner91
- 103
- 6
2
votes
1 answer
How to customize the positional embedding?
I am using the Transformer model from Hugging face for machine translation. However, my input data has relational information as shown below:
I want to craft a graph like the like the following:
________
| |
| \|/
He ended his meeting…

Exploring
- 2,493
- 11
- 56
- 97
2
votes
1 answer
Getting error while extracting key value pair using LayoutLMV2 model
I am trying to extract key value pair from scanned invoices document using LayoutLMV2 model but I am getting error. Installation guide. I am just trying to check how the model is predicting the key value pair from the document or do I need to fine…

Laxmi
- 21
- 8
2
votes
1 answer
Do weights of the [PAD] token have a function?
When looking at the weights of a transformer model, I noticed that the embedding weights for the padding token [PAD] are nonzero. I was wondering whether these weights have a function, since they are ignored in the multi-head attention layers.
Would…

Bas Krahmer
- 489
- 5
- 11
2
votes
1 answer
How to go around truncating long sentences with Hugginface Tokenizers?
I am new to tokenizers. My understanding is that the truncate attribute just cuts the sentences. But I need the whole sentence for context.
For example, my sentence is :
"Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri daha sonra 980 yılında nasıl…

canovichh
- 35
- 5
2
votes
1 answer
HuggingFace - Why does the T5 model shorten sentences?
I wanted to train the model for spell correction. I trained two models allegro/plt5-base with polish sentences and google/t5-v1_1-base with english sentences. Unfortunately, I don't know for what reason, but both models shorten the…

nietoperz21
- 303
- 3
- 12
2
votes
0 answers
how do I use ByteLevelBPETokenizer with UTF-8?
I am trying to apply BPE on a piece of text that is utf8 encoded.
Here is the code:
import io
from tokenizers import ByteLevelBPETokenizer
from tokenizers.decoders import ByteLevel
# list of the paths of your txt files
decoder = ByteLevel()
paths…

kloop
- 4,537
- 13
- 42
- 66
2
votes
1 answer
Huggingface transformers padding vs pad_to_max_length
I'm running a code by using pad_to_max_length = True and everything works fine. Only I get a warning as follow:
FutureWarning: The pad_to_max_length argument is deprecated and
will be removed in a future version, use padding=True…

Peyman
- 3,097
- 5
- 33
- 56
2
votes
5 answers
Unable to install tokenizers in Mac M1
I installed the transformers in the Macbook Pro M1 Max
Following this, I installed the tokenizers with
pip install tokenizers
It showed
Collecting tokenizers
Using cached tokenizers-0.12.1-cp39-cp39-macosx_12_0_arm64.whl
Successfully installed…

trialcritic
- 1,225
- 1
- 10
- 14
2
votes
1 answer
Calculate precision, recall, f1 score for custom dataset for multiclass classification Huggingface library
I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and was trying to see the f1 score and accuracy.
I…

Alex Kujur
- 121
- 6
2
votes
1 answer
Huggingface Load_dataset() function throws "ValueError: Couldn't cast"
My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library. Code is executed on Google Colaboratory.
My test dataset is read from this csv…

Sotel
- 23
- 1
- 5