Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
0
votes
1 answer

How to get the corresponding character or string that has been labelled as 'UNK' token in BERT?

After tokenization of a string it returns the token list consisting of separate words and special tokens. For instance, how to decode which word/character has been termed as 'UNK' token if there is any?
0
votes
0 answers

git push error:fatal: unable to access .....Port number ended with 'a'

I finetuned the t5 model and I want to upload it on my hugging face library. I have my directory, where I save tokenizer and model. tokenizer.save_pretrained('my-t5-qa-legal') trained_model.model.save_pretrained('my-t5-qa-legal') Here are files in…
0
votes
0 answers

BERT models: how robust are they to typos?

let me introduce the context briefly: I'm fine tuning a generic BERT model for the context of food and beverage. The final goal is a classification task. To train this model, I'm using a corpus of text gathered from blog posts, articles, magazines…
0
votes
0 answers

How to get the word of a embedding vector from the pretrained model of hugging face?

I use hugging face's pretrained model, bert, to help me get the meaning of sentence pooling(which means tokenize the sentence and get the average vector of all embedding words). My codes are as follows. I want to get the word which pooling vector…
0
votes
1 answer

Is BertTokenizer similar to word embedding?

The idea of using BertTokenizer from huggingface really confuses me. When I use tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") tokenizer.encode_plus("Hello") Does the result is somewhat similar to when I pass a one-hot vector…
0
votes
0 answers

How to force LineByLineTextDataset split text corpus by words rather than symbols

Based on https://github.com/huggingface/tokenizers/issues/244 question I'm trying to complete my request to use WordLevel tokenizer with roberta transformers model. My vocabulary containts numbers as string and special tokens. I have some issue and…
0
votes
1 answer

Transformers: WordLevel tokenizer produces strange vocabulary

Training the WordLevel tokenizer I receive strange vocabulary. Bellow is my code: data = [ "Beautiful is better than ugly." "Explicit is better than implicit." "Simple is better than complex." "Complex is better than complicated." …
0
votes
0 answers

RoBERTa classifier: cannot generate single prediction

I have succesfully trained a text emotion classifier fine-tuning a RoBERTa language model, mostly using a helpful notebook found online. Now I am trying to write a function to generate the prediction for a single sample (sentence), but can't seem to…
0
votes
1 answer

Extracting embedding values of NLP pertained models from tokenized strings

I am using huggingface pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the…
Kadaj13
  • 1,423
  • 3
  • 17
  • 41
0
votes
1 answer

Key error when feeding the training corpus to the train_new_from_iterator method

I am following this tutorial here: https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb So, using this code, I add my custom dataset: from datasets import load_dataset dataset = load_dataset('csv',…
0
votes
0 answers

Can k=len(vocab) be used with top_k when viewing predicted tokens?

When viewing the top predicted tokens in masked language modelling (MLM), is it possible to use top_k with k=len(vocab)? So far, I have used this following line of code: mask_filler("The capital of [MASK] is Paris", top_k=5) Would it be possible to…
0
votes
1 answer

HuggingFace T5 transformer model - how to prep a custom dataset for fine-tuning?

I am trying to use the HuggingFace library to fine-tune the T5 transformer model using a custom dataset. HF provide an example of fine-tuning with custom data but this is for distilbert model, not the T5 model I want to use. From their example it…
0
votes
1 answer

Setting `remove_unused_columns=False` causes error in HuggingFace Trainer class

I am training a model using HuggingFace Trainer class. The following code does a decent job: !pip install datasets !pip install transformers from datasets import load_dataset from transformers import AutoModelForSequenceClassification,…
0
votes
1 answer

BERT: Is it possible to filter the predicted tokens in masked language modelling?

I have trained a masked language model using my own dataset, which contains sentences with emojis (trained on 20,000 entries). Now, when I make predictions, I want emojis to be in the output, however, most of the predicted tokens are words, so I…
0
votes
1 answer

BERT: AttributeError: 'RobertaForMaskedLM' object has no attribute 'bert'

I am trying to freeze some layers of my masked language model using the following code: for param in model.bert.parameters(): param.requires_grad = False However, when I execute the code above, I get this error: AttributeError:…