Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
2
votes
1 answer
Parsing the Hugging Face Transformer Output
I am looking to use bert-english-uncased-finetuned-pos transformer, mentioned here
https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California.
I am querying the transformer this…

red-devil
- 1,064
- 1
- 20
- 34
2
votes
0 answers
Finetune TFBertForMaskedLM model.fit() ValueError
The Problem
I have been trying to train TFBertForMaskedLM model with tensorflow. But when i use model.fit() always encounter some question.Hope someone can help and propose some solution.
Reference Paper and sample output
The Paper title is…

hueiyuan su
- 21
- 1
2
votes
0 answers
Iterating through Huggingface tokenizer with remainder
Transformer models have maximum token limits. If I want to substring my text to fit within that limit, what is the generally accepted way?
Due to the treatment of special characters, it isn't the case that the tokenizer maps its tokens to something…

Mittenchops
- 18,633
- 33
- 128
- 246
2
votes
3 answers
Any reason to save a pretrained BERT tokenizer?
Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer.encode()
I have seen in most places that people…

ginobimura
- 115
- 1
- 5
2
votes
1 answer
GPT2 on Hugging face(pytorch transformers) RuntimeError: grad can be implicitly created only for scalar outputs
I am trying to fine-tune gpt2 with a custom dataset of mine. I created a basic example with the documentation from hugging-face transformers. I receive the mentioned error. I know what it means: (basically it is calling backward on a non-scalar…

Berkay Berabi
- 1,933
- 1
- 10
- 26
2
votes
2 answers
Is there a way to get the location of the substring from which a certain token has been produced in BERT?
I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings.
string = tokenizer.decode(...)
However, the…

MarciBE
- 23
- 3
1
vote
0 answers
Error when running a huggingface model in 4bit mode in Streamlit using bitsnbytes. Quant state is being set to None unwillingly
I am loading a huggingface starchat beta model in streamlit and caching it thus :
@st.cache_resource
def load_model():
"""Initialize the tokenizer and the AI model."""
tokenizer =…

Abhilash Pal
- 11
- 2
1
vote
0 answers
layoutlmv3: Issue with postprocess Method Not Returning Data Beyond 512 Tokens Despite Complete Inference
I am facing an issue with my post-processing method. I have a pipeline that involves preprocessing, inference, and post-processing steps. During the preprocessing step, I tokenize the input data and handle token overflow for sequences greater than…

j3ws3r
- 11
- 1
1
vote
0 answers
Failing to import transformers.models.t5.modeling_flax_t5
The following error occurs:
RuntimeError: Failed to import transformers.models.t5.modeling_flax_t5 because of the following error (look up to see its traceback): module 'jax.numpy' has no attribute 'DeviceArray'
when I try to run this command on…

fried_carriots
- 11
- 2
1
vote
0 answers
Why T5 can only generate sentences of length 20. Can someone help me? I wish I could generate longer sentences
from datasets import load_dataset
books = load_dataset('higashi1/mymulti30k', "en-de")
from transformers import AutoTokenizer
#checkpoint = "./logs/"
checkpoint = "t5-base"
tokenizer =…

HIKARI
- 11
- 1
1
vote
1 answer
Huggingface: How do I find the max length of a model?
Given a transformer model on huggingface, how do I find the maximum input sequence length?
For example, here I want to truncate to the max_length of the model: tokenizer(examples["text"], padding="max_length", truncation=True) How do I find the…

JobHunter69
- 1,706
- 5
- 25
- 49
1
vote
1 answer
Why is the LayoutLM installation fails to install?
I want to install LayoutLM in Google Colaboratory
First, I have cloned the LayoutLM from this GitHub repository
https://github.com/microsoft/unilm.git
After that, I will install the LayoutLM by running its setup.py file by running this code…

Scezui
- 21
- 4
1
vote
1 answer
Running LLM on a local server
I am new at LLM. I need to run a LLM on a local server and need to download different model to experiment. I am trying to follow this guide from HuggingFace https://huggingface.co/docs/transformers/installation#offline-mode
To begin with, I…

zoomraider
- 117
- 1
- 9
1
vote
1 answer
How to create huggingface tokenizer from a "char_to_idx" dict?
Given a dictionary char_to_idx how can one create a tokenizer such that the ids of the tokens are guaranteed to be the same as in char_to_idx?
char_to_idx = {'a': 0, 'b': 1, 'c': 2, 'd': 3}
tokenizer =…

Yorai Levi
- 473
- 5
- 17
1
vote
0 answers
Decoded text of huggingface Unigram tokenizer has extra spaces
decoded should be equal to text but:
import tokenizers
text = "Hello World!"
tokenizer = tokenizers.Tokenizer(tokenizers.models.Unigram())
tokenizer.train_from_iterator(text)
encoded = tokenizer.encode(text)
decoded =…

Yorai Levi
- 473
- 5
- 17