Highest Voted 'huggingface-tokenizers' Questions

8

votes

6 answers

Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers

I'm trying to build the model illustrated in this picture: I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way: from transformers import AutoTokenizer, TFBertModel model_name =…

asked Sep 15 '21 at 15:28

Gerardo Zinno

1,518
1
13
35

8

votes

2 answers

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased). QUERY: I want to ask a question. ANSWER: Sure, ask away. QUERY: How is the weather today? ANSWER: It is…

bert-language-model huggingface-tokenizers sentencepiece

asked Sep 15 '21 at 10:24

sid8491

6,622
6
38
64

8

votes

1 answer

what is the difference between len(tokenizer) and tokenizer.vocab_size

I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did the following to change the vocabulary of the tokenizer and also increase the embedding size of the model: tokenizer.add_tokens(['word1',…

nlp tokenize huggingface-transformers huggingface-tokenizers

asked May 06 '21 at 06:30

mitra mirshafiee

393
6
17

8

votes

0 answers

How to use threads for huggingface transformers

I'm trying to run a hugging face model, mode exactly "cardiffnlp/twitter-roberta-base-sentiment" on threads. But at the same time, I want just one single instance of it because it's really costly in terms of time. In other words, I have multiple CSV…

python multithreading threadpool huggingface-transformers huggingface-tokenizers

asked Jan 15 '21 at 22:38

Mircea

1,671
7
25
41

8

votes

1 answer

Do I need to pre-tokenize the text first before using HuggingFace's RobertaTokenizer? (Different undersanding)

I feel confused when using the Roberta tokenizer in Huggingface. >>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base') >>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.") ['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'big', ')',…

huggingface-transformers huggingface-tokenizers

asked Jun 17 '20 at 06:19

Allan-J

336
4
11

7

votes

5 answers

SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /dslim/bert-base-NER/resolve/main/tokenizer_config.json

I am facing below issue while loading the pretrained BERT model from HuggingFace due to SSL certificate error. Error: SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url:…

python-3.x huggingface-transformers bert-language-model huggingface-tokenizers huggingface

asked Jan 13 '23 at 15:09

Nikita Malviya

181
1
2
7

7

votes

1 answer

Key Error while fine tunning T5 for summarization with HuggingFace

I am trying to fine tune the T5 transformer for summarization but I am receiving a key error message: KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers' The…

python huggingface-transformers huggingface-tokenizers

asked May 25 '21 at 15:55

Johnpac

85
1
7

6

votes

1 answer

How padding in huggingface tokenizer works?

I tried following tokenization example: tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True) sent = "I hate this. Not that.", _tokenized = tokenizer(sent, padding=True, max_length=20,…

nlp huggingface-transformers bert-language-model transformer-model huggingface-tokenizers

asked Nov 22 '21 at 14:43

MsA

2,599
3
22
47

6

votes

1 answer

transformers AutoTokenizer.tokenize introducing extra characters

I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with…

python huggingface-transformers huggingface-tokenizers

asked Nov 10 '21 at 23:57

Ciaran

451
1
4
14

6

votes

2 answers

BERT get sentence embedding

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding. I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time. Is there a way to expedite…

python nlp huggingface-transformers bert-language-model huggingface-tokenizers

asked Oct 10 '21 at 17:32

user2543622

5,760
25
91
159

6

votes

2 answers

what 's the meaning of "Using bos_token, but it is not set yet."

When I run the demo.py from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased") model = AutoModel.from_pretrained("distilbert-base-multilingual-cased", return_dict=True) #…

multilingual huggingface-transformers huggingface-tokenizers distilbert

asked Dec 21 '20 at 03:06

young

61
1
4

6

votes

2 answers

AutoTokenizer.from_pretrained fails to load locally saved pretrained tokenizer (PyTorch)

I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace. I am successful in downloading and running them. But if I try to save them and load again, then some error…

python deep-learning pytorch huggingface-transformers huggingface-tokenizers

asked Jun 19 '20 at 14:17

ferty567

61
1
1
3

6

votes

1 answer

BertWordPieceTokenizer vs BertTokenizer from HuggingFace

I have the following pieces of code and trying to understand the difference between BertWordPieceTokenizer and BertTokenizer. BertWordPieceTokenizer (Rust based) from tokenizers import BertWordPieceTokenizer sequence = "Hello, y'all! How are you…

nlp huggingface-transformers bert-language-model huggingface-tokenizers

asked Jun 16 '20 at 09:19

HopeKing

3,317
7
39
62

5

votes

1 answer

why does huggingface t5 tokenizer ignore some of the whitespaces?

I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes…

huggingface-transformers huggingface-tokenizers sentencepiece

asked May 12 '22 at 11:04

Berkay Berabi

1,933
1
10
26

5

votes

1 answer

How to convert tokenized words back to the original ones after inference?

I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words. # example input df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks…

python pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Sep 21 '21 at 19:12

deonardo_licaprio

308
1
11

Questions tagged [huggingface-tokenizers]