Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
5
votes
2 answers

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("bert-base-cased") sentence = "The Natural Science…
5
votes
1 answer

On-the-fly tokenization with datasets, tokenizers, and torch Datasets and Dataloaders

I have a question regarding "on-the-fly" tokenization. This question was elicited by reading the "How to train a new language model from scratch using Transformers and Tokenizers" here. Towards the end there is this sentence: "If your dataset is…
Pietro
  • 415
  • 6
  • 16
5
votes
3 answers

Huggingface BERT Tokenizer add new token

I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords. tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') tokenizer.encode_plus("Somespecialcompany") output: {'input_ids':…
5
votes
2 answers

How do I translate using HuggingFace from Chinese to English?

I want to translate from Chinese to English using HuggingFace's transformers using a pretrained "xlm-mlm-xnli15-1024" model. This tutorial shows how to do it from English to German. I tried following the tutorial but it doesn't detail how to…
5
votes
3 answers

Hugging-Face Transformers: Loading model from path error

I am pretty new to Hugging-Face transformers. I am facing the following issue when I try to load xlm-roberta-base model from a given path: >> tokenizer = AutoTokenizer.from_pretrained(model_path) >> Traceback (most recent call last): File…
Spartan
  • 51
  • 1
  • 3
5
votes
3 answers

Huggingface Summarization

I am practicing with Transformers to summarize text. Following the tutorial at : https://huggingface.co/transformers/usage.html#summarization from transformers import pipeline summarizer = pipeline("summarization") ARTICLE = """ New York (CNN)When…
xamlova
  • 51
  • 1
  • 3
4
votes
1 answer

How to fine tune a Huggingface Seq2Seq model with a dataset from the hub?

I want to train the "flax-community/t5-large-wikisplit" model with the "dxiao/requirements-ner-id" dataset. (Just for some experiments) I think my general procedure is not correct, but I don't know how to go further. My Code: Load tokenizer and…
4
votes
2 answers

How to handle sequences longer than 512 tokens in layoutLMV3?

How to work with sequences longer than 512 tokens. I don't wanted to use truncates =True. But actually wanted to handle the longer sequences
4
votes
1 answer

Tokenizer.from_file() HUGGINFACE : Exception: data did not match any variant of untagged enum ModelWrapper

I am having issue loading a Tokenizer.from_file() BPE tokenizer. When I try I am encountering this error where the line 11743 is the last last one: Exception: data did not match any variant of untagged enum ModelWrapper at line 11743 column 3 I have…
4
votes
1 answer

How to drop sentences that are too long in Huggingface?

I'm going through the Huggingface tutorial and it appears as the library has automatic truncation, to cut sentences that are too long, based on a max value, or other things. How can I remove sentences for the same reasoning (sentences are too long,…
4
votes
3 answers

How to apply max_length to truncate the token sequence from the left in a HuggingFace tokenizer?

In the HuggingFace tokenizer, applying the max_length argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2 (if truncation=True) by cutting the excess tokens from the right. For the purposes of…
4
votes
2 answers

SpeechBrain: Cannot Load Pretrained Model from Local Path

I'm trying to load a pretrained SpeechBrain HuggingFace model from local files; I don't want it to call out to HuggingFace to download. However, unless I change the pretrained_path in hyperparams.yaml, it is still calling out to HuggingFace and…
4
votes
0 answers

Internal RuntimeError when using a custom fine-tuned model

I tried to fine-tune this model I found on huggingface (https://github.com/flexudy-pipe/sentence-doctor) in order to make it more performant with french, however, I have a problem. I did use the train_any_t5_task.py file the author gave…
4
votes
1 answer

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation

I'm learning NLP following this sequence classification tutorial from HuggingFace https://huggingface.co/transformers/custom_datasets.html#sequence-classification-with-imdb-reviews The original code runs without problem. But when I tried to load a…
Rafael
  • 1,761
  • 1
  • 14
  • 21
4
votes
1 answer

How to download hugging face sentiment-analysis pipeline to use it offline?

How to download hugging face sentiment-analysis pipeline to use it offline? I'm unable to use hugging face sentiment analysis pipeline without internet. How to download that pipeline? The basic code for sentiment analysis using hugging face is from…
1 2
3
29 30