Highest Voted 'huggingface-tokenizers' Questions

0

votes

1 answer

How to get the corresponding character or string that has been labelled as 'UNK' token in BERT?

After tokenization of a string it returns the token list consisting of separate words and special tokens. For instance, how to decode which word/character has been termed as 'UNK' token if there is any?

asked Oct 27 '21 at 05:27

Sazzad

19
7

0

votes

0 answers

git push error:fatal: unable to access .....Port number ended with 'a'

I finetuned the t5 model and I want to upload it on my hugging face library. I have my directory, where I save tokenizer and model. tokenizer.save_pretrained('my-t5-qa-legal') trained_model.model.save_pretrained('my-t5-qa-legal') Here are files in…

git google-colaboratory huggingface-transformers huggingface-tokenizers pytorch-lightning

asked Oct 17 '21 at 12:47

Sht

179
1
1
9

0

votes

0 answers

BERT models: how robust are they to typos?

let me introduce the context briefly: I'm fine tuning a generic BERT model for the context of food and beverage. The final goal is a classification task. To train this model, I'm using a corpus of text gathered from blog posts, articles, magazines…

nlp huggingface-transformers bert-language-model huggingface-tokenizers

asked Oct 14 '21 at 15:37

wtfzambo

578
1
12
21

0

votes

0 answers

How to get the word of a embedding vector from the pretrained model of hugging face?

I use hugging face's pretrained model, bert, to help me get the meaning of sentence pooling(which means tokenize the sentence and get the average vector of all embedding words). My codes are as follows. I want to get the word which pooling vector…

python deep-learning nlp huggingface-transformers huggingface-tokenizers

asked Sep 23 '21 at 02:51

Maxwell Albert

112
7

0

votes

1 answer

Is BertTokenizer similar to word embedding?

The idea of using BertTokenizer from huggingface really confuses me. When I use tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") tokenizer.encode_plus("Hello") Does the result is somewhat similar to when I pass a one-hot vector…

deep-learning nlp word-embedding huggingface-tokenizers

asked Sep 05 '21 at 10:24

Quang Đại Nguyễn

35
6

0

votes

0 answers

How to force LineByLineTextDataset split text corpus by words rather than symbols

Based on https://github.com/huggingface/tokenizers/issues/244 question I'm trying to complete my request to use WordLevel tokenizer with roberta transformers model. My vocabulary containts numbers as string and special tokens. I have some issue and…

python huggingface-transformers huggingface-tokenizers

asked Aug 27 '21 at 18:58

Roman Kazmin

931
6
18

0

votes

1 answer

Transformers: WordLevel tokenizer produces strange vocabulary

Training the WordLevel tokenizer I receive strange vocabulary. Bellow is my code: data = [ "Beautiful is better than ugly." "Explicit is better than implicit." "Simple is better than complex." "Complex is better than complicated." …

python huggingface-transformers huggingface-tokenizers

asked Aug 25 '21 at 18:02

Roman Kazmin

931
6
18

0

votes

0 answers

RoBERTa classifier: cannot generate single prediction

I have succesfully trained a text emotion classifier fine-tuning a RoBERTa language model, mostly using a helpful notebook found online. Now I am trying to write a function to generate the prediction for a single sample (sentence), but can't seem to…

python nlp pytorch huggingface-transformers huggingface-tokenizers

asked Aug 24 '21 at 17:28

user14501128

21
3

0

votes

1 answer

Extracting embedding values of NLP pertained models from tokenized strings

I am using huggingface pipeline to extract embeddings of words in a sentence. As far as I know, first a sentence will be turned into a tokenized strings. I think the length of the tokenized string might not be equal to the number of words in the…

python nlp tokenize word-embedding huggingface-tokenizers

asked Aug 17 '21 at 06:56

Kadaj13

1,423
3
17
41

0

votes

1 answer

Key error when feeding the training corpus to the train_new_from_iterator method

I am following this tutorial here: https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb So, using this code, I add my custom dataset: from datasets import load_dataset dataset = load_dataset('csv',…

python bert-language-model huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Aug 10 '21 at 12:07

user16098918

0

votes

0 answers

Can k=len(vocab) be used with top_k when viewing predicted tokens?

When viewing the top predicted tokens in masked language modelling (MLM), is it possible to use top_k with k=len(vocab)? So far, I have used this following line of code: mask_filler("The capital of [MASK] is Paris", top_k=5) Would it be possible to…

python machine-learning bert-language-model huggingface-transformers huggingface-tokenizers

asked Aug 07 '21 at 14:05

user16098918

0

votes

1 answer

HuggingFace T5 transformer model - how to prep a custom dataset for fine-tuning?

I am trying to use the HuggingFace library to fine-tune the T5 transformer model using a custom dataset. HF provide an example of fine-tuning with custom data but this is for distilbert model, not the T5 model I want to use. From their example it…

python nlp huggingface-transformers huggingface-tokenizers

asked Aug 02 '21 at 17:02

TheLongSentance

1
4

0

votes

1 answer

Setting `remove_unused_columns=False` causes error in HuggingFace Trainer class

I am training a model using HuggingFace Trainer class. The following code does a decent job: !pip install datasets !pip install transformers from datasets import load_dataset from transformers import AutoModelForSequenceClassification,…

pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Jul 28 '21 at 08:35

Hossein

2,041
1
16
29

0

votes

1 answer

BERT: Is it possible to filter the predicted tokens in masked language modelling?

I have trained a masked language model using my own dataset, which contains sentences with emojis (trained on 20,000 entries). Now, when I make predictions, I want emojis to be in the output, however, most of the predicted tokens are words, so I…

python machine-learning bert-language-model huggingface-transformers huggingface-tokenizers

asked Jul 23 '21 at 16:41

user16098918

0

votes

1 answer

BERT: AttributeError: 'RobertaForMaskedLM' object has no attribute 'bert'

I am trying to freeze some layers of my masked language model using the following code: for param in model.bert.parameters(): param.requires_grad = False However, when I execute the code above, I get this error: AttributeError:…

python bert-language-model huggingface-transformers huggingface-tokenizers

asked Jul 14 '21 at 19:36

user16098918

Questions tagged [huggingface-tokenizers]