Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
1
vote
2 answers
AttributeError: 'BloomForCausalLM' object has no attribute 'encode'
I'm trying to do some basic text inference using the bloom model
from transformers import AutoModelForCausalLM, AutoModel
# checkpoint = "bigscience/bloomz-7b1-mt"
checkpoint = "bigscience/bloom-1b7"
tokenizer =…

Tobi Akinyemi
- 804
- 1
- 8
- 24
1
vote
1 answer
Do huggingface translation models support separate vocabulary for source and target?
Every example I've looked at so far seems to use a shared vocabulary between source and target languages, and I'm wondering if that is a hard-coded constraint of the Huggingface models, or my misunderstanding, or I've just not looked in the right…

Darren Cook
- 27,837
- 13
- 117
- 217
1
vote
0 answers
Training an Hugginface model without n_epochs
I would like to train from scratch a RobertaForMaskedLM in Hugginface.
However I would like to not specify any stopping time, but to stop only when there is no more improvement in the training. There is a way to do that? I know that the n_epochs…

Chiara
- 372
- 5
- 17
1
vote
1 answer
How do I know which parameters to use with a pretrained Tokenizer?
I must be missing something ...
I want to use a pretrained model with HuggingFace:
transformer_name = "Geotrend/distilbert-base-fr-cased" # Or whatever model
model = AutoModelForSequenceClassification.from_pretrained(transformer_name,…

Alexandre GAREL
- 53
- 6
1
vote
1 answer
ValueError: bytes must be in range(0, 256) while decoding input tensor using transformer AutoTokenizer (MT5ForConditionalGerneration Model)
Relevant Code :
from transformers import (
AdamW,
MT5ForConditionalGeneration,
AutoTokenizer,
get_linear_schedule_with_warmup
)
tokenizer = AutoTokenizer.from_pretrained('google/byt5-small',…

iamabhaykmr
- 1,803
- 3
- 24
- 49
1
vote
0 answers
Python - Docker socket hang up after first successful API call, docker exits mid way through second call
Trying a python program, using hugging face transformers & faiss. I was able to use the API successfully while testing locally. But while testing the same inside docker, the api executes successfully the first time & the I get a Error : Socket hang…

Megha John
- 153
- 1
- 12
1
vote
1 answer
Huggingface tokenizer not able to load model after upgrading python to 3.10
I just updated Python to version 3.10.8. Note that I use JupyterLab.
I had to re-install a lot of packages, but now I get an error when I try to load the tokenizer of an HuggingFace model
This is my code:
# Import libraries
from transformers import…

SilentCloud
- 1,677
- 3
- 9
- 28
1
vote
0 answers
Does checkpointing with torch.save fail with hugging face -- if not what is the right way to checkpoint and load a hugging face (HF) model?
Does torch.save work on hugging face models (I am using vit)? I assumed yes.
My error:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile,…

Charlie Parker
- 5,884
- 57
- 198
- 323
1
vote
1 answer
Building wheel for tokenizers (pyproject.toml) did not run successfully - Python 3.9.9 - Windows 10
Yes there are several other questions like this but no solution provided
I am trying to install and run this project
https://github.com/xashru/punctuation-restoration
I have cloned the github repository
Installed rust from here downloading x64 :…

Furkan Gözükara
- 22,964
- 77
- 205
- 342
1
vote
0 answers
Why I am getting tensor of NaN values in PyTorch Huggingface inference?
I am fine-tuning distil-bert model for 200k iterations. Once it saves the checkpoint file, I do the inference. However, my inference vector for any random text is Nan. An example output is below. Does anyone have any idea ?
tensor([[[nan, nan, nan,…

Ramraj Chandradevan
- 141
- 2
- 10
1
vote
1 answer
Getting an error install a package on the Terminal to use Hugging Face In VS Cod
I am using the steps from the Hugging Face website (https://huggingface.co/docs/transformers/installation) in order to start using hugging face in Visual Studio Code and install all the transformers.
I was on the last process, where I had to type…

waleeed
- 35
- 7
1
vote
1 answer
How to get a loss from Huggingface's pipeline method in order to finetune a model?
I'm trying to use this model on huggingface for QA. The code for it is in the link:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
model_name = "deepset/roberta-base-squad2"
# a) Get predictions
nlp =…

Penguin
- 1,923
- 3
- 21
- 51
1
vote
1 answer
from transformers import BertTokenizer
I am trying to implement the following model from hugging face but not entirely sure how to feed the model the texts that I need to pass to do the classification. The documentation (https://huggingface.co/DaNLP/da-bert-tone-subjective-objective)…

Bemz
- 129
- 1
- 16
1
vote
0 answers
Unexpected keyword argument 'unk_token'
When trying to load this tokenizer I am getting this error but I don't know why it can't take the ink_token strangely. Any ideas?
tokenizer = tokenizers.SentencePieceUnigramTokenizer(unk_token="", eos_token="", pad_token="")
----> 1 tokenizer =…

Antoine23
- 79
- 1
- 5
1
vote
0 answers
How does Byte-pair Encoding handle equally frequent pairs?
Let's say we train BPE tokenizer on this string:
D C B B A B C D C B A B C D
As I understand it merges the most frequent pairs, but what will the algorithm merge here first?
DC, BC, CD, BA, or AB? All occur 2 times in this dummy corpus.
Seems like…

Nikolay Klimenko
- 11
- 1