Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
2
votes
1 answer

Parsing the Hugging Face Transformer Output

I am looking to use bert-english-uncased-finetuned-pos transformer, mentioned here https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California. I am querying the transformer this…
red-devil
  • 1,064
  • 1
  • 20
  • 34
2
votes
0 answers

Finetune TFBertForMaskedLM model.fit() ValueError

The Problem I have been trying to train TFBertForMaskedLM model with tensorflow. But when i use model.fit() always encounter some question.Hope someone can help and propose some solution. Reference Paper and sample output The Paper title is…
2
votes
0 answers

Iterating through Huggingface tokenizer with remainder

Transformer models have maximum token limits. If I want to substring my text to fit within that limit, what is the generally accepted way? Due to the treatment of special characters, it isn't the case that the tokenizer maps its tokens to something…
Mittenchops
  • 18,633
  • 33
  • 128
  • 246
2
votes
3 answers

Any reason to save a pretrained BERT tokenizer?

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer.encode() I have seen in most places that people…
2
votes
1 answer

GPT2 on Hugging face(pytorch transformers) RuntimeError: grad can be implicitly created only for scalar outputs

I am trying to fine-tune gpt2 with a custom dataset of mine. I created a basic example with the documentation from hugging-face transformers. I receive the mentioned error. I know what it means: (basically it is calling backward on a non-scalar…
2
votes
2 answers

Is there a way to get the location of the substring from which a certain token has been produced in BERT?

I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings. string = tokenizer.decode(...) However, the…
1
vote
0 answers

Error when running a huggingface model in 4bit mode in Streamlit using bitsnbytes. Quant state is being set to None unwillingly

I am loading a huggingface starchat beta model in streamlit and caching it thus : @st.cache_resource def load_model(): """Initialize the tokenizer and the AI model.""" tokenizer =…
1
vote
0 answers

layoutlmv3: Issue with postprocess Method Not Returning Data Beyond 512 Tokens Despite Complete Inference

I am facing an issue with my post-processing method. I have a pipeline that involves preprocessing, inference, and post-processing steps. During the preprocessing step, I tokenize the input data and handle token overflow for sequences greater than…
1
vote
0 answers

Failing to import transformers.models.t5.modeling_flax_t5

The following error occurs: RuntimeError: Failed to import transformers.models.t5.modeling_flax_t5 because of the following error (look up to see its traceback): module 'jax.numpy' has no attribute 'DeviceArray' when I try to run this command on…
1
vote
0 answers

Why T5 can only generate sentences of length 20. Can someone help me? I wish I could generate longer sentences

from datasets import load_dataset books = load_dataset('higashi1/mymulti30k', "en-de") from transformers import AutoTokenizer #checkpoint = "./logs/" checkpoint = "t5-base" tokenizer =…
1
vote
1 answer

Huggingface: How do I find the max length of a model?

Given a transformer model on huggingface, how do I find the maximum input sequence length? For example, here I want to truncate to the max_length of the model: tokenizer(examples["text"], padding="max_length", truncation=True) How do I find the…
1
vote
1 answer

Why is the LayoutLM installation fails to install?

I want to install LayoutLM in Google Colaboratory First, I have cloned the LayoutLM from this GitHub repository https://github.com/microsoft/unilm.git After that, I will install the LayoutLM by running its setup.py file by running this code…
1
vote
1 answer

Running LLM on a local server

I am new at LLM. I need to run a LLM on a local server and need to download different model to experiment. I am trying to follow this guide from HuggingFace https://huggingface.co/docs/transformers/installation#offline-mode To begin with, I…
1
vote
1 answer

How to create huggingface tokenizer from a "char_to_idx" dict?

Given a dictionary char_to_idx how can one create a tokenizer such that the ids of the tokens are guaranteed to be the same as in char_to_idx? char_to_idx = {'a': 0, 'b': 1, 'c': 2, 'd': 3} tokenizer =…
Yorai Levi
  • 473
  • 5
  • 17
1
vote
0 answers

Decoded text of huggingface Unigram tokenizer has extra spaces

decoded should be equal to text but: import tokenizers text = "Hello World!" tokenizer = tokenizers.Tokenizer(tokenizers.models.Unigram()) tokenizer.train_from_iterator(text) encoded = tokenizer.encode(text) decoded =…
Yorai Levi
  • 473
  • 5
  • 17