Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
1
vote
2 answers

AttributeError: 'BloomForCausalLM' object has no attribute 'encode'

I'm trying to do some basic text inference using the bloom model from transformers import AutoModelForCausalLM, AutoModel # checkpoint = "bigscience/bloomz-7b1-mt" checkpoint = "bigscience/bloom-1b7" tokenizer =…
1
vote
1 answer

Do huggingface translation models support separate vocabulary for source and target?

Every example I've looked at so far seems to use a shared vocabulary between source and target languages, and I'm wondering if that is a hard-coded constraint of the Huggingface models, or my misunderstanding, or I've just not looked in the right…
1
vote
0 answers

Training an Hugginface model without n_epochs

I would like to train from scratch a RobertaForMaskedLM in Hugginface. However I would like to not specify any stopping time, but to stop only when there is no more improvement in the training. There is a way to do that? I know that the n_epochs…
1
vote
1 answer

How do I know which parameters to use with a pretrained Tokenizer?

I must be missing something ... I want to use a pretrained model with HuggingFace: transformer_name = "Geotrend/distilbert-base-fr-cased" # Or whatever model model = AutoModelForSequenceClassification.from_pretrained(transformer_name,…
1
vote
1 answer

ValueError: bytes must be in range(0, 256) while decoding input tensor using transformer AutoTokenizer (MT5ForConditionalGerneration Model)

Relevant Code : from transformers import ( AdamW, MT5ForConditionalGeneration, AutoTokenizer, get_linear_schedule_with_warmup ) tokenizer = AutoTokenizer.from_pretrained('google/byt5-small',…
1
vote
0 answers

Python - Docker socket hang up after first successful API call, docker exits mid way through second call

Trying a python program, using hugging face transformers & faiss. I was able to use the API successfully while testing locally. But while testing the same inside docker, the api executes successfully the first time & the I get a Error : Socket hang…
1
vote
1 answer

Huggingface tokenizer not able to load model after upgrading python to 3.10

I just updated Python to version 3.10.8. Note that I use JupyterLab. I had to re-install a lot of packages, but now I get an error when I try to load the tokenizer of an HuggingFace model This is my code: # Import libraries from transformers import…
1
vote
0 answers

Does checkpointing with torch.save fail with hugging face -- if not what is the right way to checkpoint and load a hugging face (HF) model?

Does torch.save work on hugging face models (I am using vit)? I assumed yes. My error: File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 379, in save _save(obj, opened_zipfile,…
1
vote
1 answer

Building wheel for tokenizers (pyproject.toml) did not run successfully - Python 3.9.9 - Windows 10

Yes there are several other questions like this but no solution provided I am trying to install and run this project https://github.com/xashru/punctuation-restoration I have cloned the github repository Installed rust from here downloading x64 :…
Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
1
vote
0 answers

Why I am getting tensor of NaN values in PyTorch Huggingface inference?

I am fine-tuning distil-bert model for 200k iterations. Once it saves the checkpoint file, I do the inference. However, my inference vector for any random text is Nan. An example output is below. Does anyone have any idea ? tensor([[[nan, nan, nan,…
1
vote
1 answer

Getting an error install a package on the Terminal to use Hugging Face In VS Cod

I am using the steps from the Hugging Face website (https://huggingface.co/docs/transformers/installation) in order to start using hugging face in Visual Studio Code and install all the transformers. I was on the last process, where I had to type…
1
vote
1 answer

How to get a loss from Huggingface's pipeline method in order to finetune a model?

I'm trying to use this model on huggingface for QA. The code for it is in the link: from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline model_name = "deepset/roberta-base-squad2" # a) Get predictions nlp =…
1
vote
1 answer

from transformers import BertTokenizer

I am trying to implement the following model from hugging face but not entirely sure how to feed the model the texts that I need to pass to do the classification. The documentation (https://huggingface.co/DaNLP/da-bert-tone-subjective-objective)…
1
vote
0 answers

Unexpected keyword argument 'unk_token'

When trying to load this tokenizer I am getting this error but I don't know why it can't take the ink_token strangely. Any ideas? tokenizer = tokenizers.SentencePieceUnigramTokenizer(unk_token="", eos_token="", pad_token="") ----> 1 tokenizer =…
1
vote
0 answers

How does Byte-pair Encoding handle equally frequent pairs?

Let's say we train BPE tokenizer on this string: D C B B A B C D C B A B C D As I understand it merges the most frequent pairs, but what will the algorithm merge here first? DC, BC, CD, BA, or AB? All occur 2 times in this dummy corpus. Seems like…