Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
2
votes
0 answers

Your fast tokenizer does not have the necessary information to save the vocabulary for a slow tokenizer

I'm trying to fine tune a t5 model for paraphrasing Farsi sentences. I'm using this model as my base. My dataset is a paired sentence dataset which each row is a pair of paraphrased sentences. I want to fine tune the model on this dataset. The…
2
votes
1 answer

KeyError: 'eval_loss' in Hugginface Trainer

I am trying to build a Question Answering Pipeline with the Hugginface framework but facing the KeyError: 'eval_loss' error. My goal is to train and save the best model at last and evaluate the validation test on the loaded model. My trainer…
2
votes
1 answer

How to know if HuggingFace's pipeline text input exceeds 512 tokens

I've finetuned a Huggingface BERT model for Named Entity Recognition based on 'bert-base-uncased'. I perform inference like this: from transformers import pipeline ner_pipeline = pipeline('token-classification', model=model_folder,…
ClaudiaR
  • 3,108
  • 2
  • 13
  • 27
2
votes
1 answer

How to pass arguments to HuggingFace TokenClassificationPipeline's tokenizer

I've finetuned a Huggingface BERT model for Named Entity Recognition. Everything is working as it should. Now I've setup a pipeline for token classification in order to predict entities out the text I provide. Even this is working fine. I know that…
2
votes
0 answers

Huggingface pre-trained model

I try to use the below code: from transformers import AutoTokenizer, AutoModel t = "ProsusAI/finbert" tokenizer = AutoTokenizer.from_pretrained(t) model = AutoModel.from_pretrained(t) The error: I think this error is due to the old version of…
2
votes
1 answer

How to customize the positional embedding?

I am using the Transformer model from Hugging face for machine translation. However, my input data has relational information as shown below: I want to craft a graph like the like the following: ________ | | | \|/ He ended his meeting…
2
votes
1 answer

Getting error while extracting key value pair using LayoutLMV2 model

I am trying to extract key value pair from scanned invoices document using LayoutLMV2 model but I am getting error. Installation guide. I am just trying to check how the model is predicting the key value pair from the document or do I need to fine…
2
votes
1 answer

Do weights of the [PAD] token have a function?

When looking at the weights of a transformer model, I noticed that the embedding weights for the padding token [PAD] are nonzero. I was wondering whether these weights have a function, since they are ignored in the multi-head attention layers. Would…
2
votes
1 answer

How to go around truncating long sentences with Hugginface Tokenizers?

I am new to tokenizers. My understanding is that the truncate attribute just cuts the sentences. But I need the whole sentence for context. For example, my sentence is : "Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri daha sonra 980 yılında nasıl…
2
votes
1 answer

HuggingFace - Why does the T5 model shorten sentences?

I wanted to train the model for spell correction. I trained two models allegro/plt5-base with polish sentences and google/t5-v1_1-base with english sentences. Unfortunately, I don't know for what reason, but both models shorten the…
2
votes
0 answers

how do I use ByteLevelBPETokenizer with UTF-8?

I am trying to apply BPE on a piece of text that is utf8 encoded. Here is the code: import io from tokenizers import ByteLevelBPETokenizer from tokenizers.decoders import ByteLevel # list of the paths of your txt files decoder = ByteLevel() paths…
kloop
  • 4,537
  • 13
  • 42
  • 66
2
votes
1 answer

Huggingface transformers padding vs pad_to_max_length

I'm running a code by using pad_to_max_length = True and everything works fine. Only I get a warning as follow: FutureWarning: The pad_to_max_length argument is deprecated and will be removed in a future version, use padding=True…
Peyman
  • 3,097
  • 5
  • 33
  • 56
2
votes
5 answers

Unable to install tokenizers in Mac M1

I installed the transformers in the Macbook Pro M1 Max Following this, I installed the tokenizers with pip install tokenizers It showed Collecting tokenizers Using cached tokenizers-0.12.1-cp39-cp39-macosx_12_0_arm64.whl Successfully installed…
trialcritic
  • 1,225
  • 1
  • 10
  • 14
2
votes
1 answer

Calculate precision, recall, f1 score for custom dataset for multiclass classification Huggingface library

I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and was trying to see the f1 score and accuracy. I…
2
votes
1 answer

Huggingface Load_dataset() function throws "ValueError: Couldn't cast"

My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library. Code is executed on Google Colaboratory. My test dataset is read from this csv…