Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
0
votes
1 answer
How to get the corresponding character or string that has been labelled as 'UNK' token in BERT?
After tokenization of a string it returns the token list consisting of separate words and special tokens. For instance, how to decode which word/character has been termed as 'UNK' token if there is any?

Sazzad
- 19
- 7
0
votes
0 answers
git push error:fatal: unable to access .....Port number ended with 'a'
I finetuned the t5 model and I want to upload it on my hugging face library.
I have my directory, where I save tokenizer and model.
tokenizer.save_pretrained('my-t5-qa-legal')
trained_model.model.save_pretrained('my-t5-qa-legal')
Here are files in…

Sht
- 179
- 1
- 1
- 9
0
votes
0 answers
BERT models: how robust are they to typos?
let me introduce the context briefly: I'm fine tuning a generic BERT model for the context of food and beverage. The final goal is a classification task.
To train this model, I'm using a corpus of text gathered from blog posts, articles, magazines…

wtfzambo
- 578
- 1
- 12
- 21
0
votes
0 answers
How to get the word of a embedding vector from the pretrained model of hugging face?
I use hugging face's pretrained model, bert, to help me get the meaning of sentence pooling(which means tokenize the sentence and get the average vector of all embedding words). My codes are as follows. I want to get the word which pooling vector…

Maxwell Albert
- 112
- 7
0
votes
1 answer
Is BertTokenizer similar to word embedding?
The idea of using BertTokenizer from huggingface really confuses me.
When I use
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode_plus("Hello")
Does the result is somewhat similar to when I pass
a one-hot vector…

Quang Đại Nguyễn
- 35
- 6
0
votes
0 answers
How to force LineByLineTextDataset split text corpus by words rather than symbols
Based on https://github.com/huggingface/tokenizers/issues/244 question I'm trying to complete my request to use WordLevel tokenizer with roberta transformers model. My vocabulary containts numbers as string and special tokens. I have some issue and…

Roman Kazmin
- 931
- 6
- 18
0
votes
1 answer
Transformers: WordLevel tokenizer produces strange vocabulary
Training the WordLevel tokenizer I receive strange vocabulary. Bellow is my code:
data = [
"Beautiful is better than ugly."
"Explicit is better than implicit."
"Simple is better than complex."
"Complex is better than complicated."
…

Roman Kazmin
- 931
- 6
- 18
0
votes
0 answers
RoBERTa classifier: cannot generate single prediction
I have succesfully trained a text emotion classifier fine-tuning a RoBERTa language model, mostly using a helpful notebook found online. Now I am trying to write a function to generate the prediction for a single sample (sentence), but can't seem to…

user14501128
- 21
- 3
0
votes
1 answer
Extracting embedding values of NLP pertained models from tokenized strings
I am using huggingface pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the…

Kadaj13
- 1,423
- 3
- 17
- 41
0
votes
1 answer
Key error when feeding the training corpus to the train_new_from_iterator method
I am following this tutorial here: https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb
So, using this code, I add my custom dataset:
from datasets import load_dataset
dataset = load_dataset('csv',…
user16098918
0
votes
0 answers
Can k=len(vocab) be used with top_k when viewing predicted tokens?
When viewing the top predicted tokens in masked language modelling (MLM), is it possible to use top_k with k=len(vocab)?
So far, I have used this following line of code:
mask_filler("The capital of [MASK] is Paris", top_k=5)
Would it be possible to…
user16098918
0
votes
1 answer
HuggingFace T5 transformer model - how to prep a custom dataset for fine-tuning?
I am trying to use the HuggingFace library to fine-tune the T5 transformer model using a custom dataset. HF provide an example of fine-tuning with custom data but this is for distilbert model, not the T5 model I want to use. From their example it…

TheLongSentance
- 1
- 4
0
votes
1 answer
Setting `remove_unused_columns=False` causes error in HuggingFace Trainer class
I am training a model using HuggingFace Trainer class. The following code does a decent job:
!pip install datasets
!pip install transformers
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification,…

Hossein
- 2,041
- 1
- 16
- 29
0
votes
1 answer
BERT: Is it possible to filter the predicted tokens in masked language modelling?
I have trained a masked language model using my own dataset, which contains sentences with emojis (trained on 20,000 entries).
Now, when I make predictions, I want emojis to be in the output, however, most of the predicted tokens are words, so I…
user16098918
0
votes
1 answer
BERT: AttributeError: 'RobertaForMaskedLM' object has no attribute 'bert'
I am trying to freeze some layers of my masked language model using the following code:
for param in model.bert.parameters():
param.requires_grad = False
However, when I execute the code above, I get this error:
AttributeError:…
user16098918