Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
8
votes
6 answers
Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers
I'm trying to build the model illustrated in this picture:
I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way:
from transformers import AutoTokenizer, TFBertModel
model_name =…

Gerardo Zinno
- 1,518
- 1
- 13
- 35
8
votes
2 answers
How to add new special token to the tokenizer?
I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).
QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is…

sid8491
- 6,622
- 6
- 38
- 64
8
votes
1 answer
what is the difference between len(tokenizer) and tokenizer.vocab_size
I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did the following to change the vocabulary of the tokenizer and also increase the embedding size of the model:
tokenizer.add_tokens(['word1',…

mitra mirshafiee
- 393
- 6
- 17
8
votes
0 answers
How to use threads for huggingface transformers
I'm trying to run a hugging face model, mode exactly "cardiffnlp/twitter-roberta-base-sentiment" on threads. But at the same time, I want just one single instance of it because it's really costly in terms of time.
In other words, I have multiple CSV…

Mircea
- 1,671
- 7
- 25
- 41
8
votes
1 answer
Do I need to pre-tokenize the text first before using HuggingFace's RobertaTokenizer? (Different undersanding)
I feel confused when using the Roberta tokenizer in Huggingface.
>>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
>>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.")
['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'big', ')',…

Allan-J
- 336
- 4
- 11
7
votes
5 answers
SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /dslim/bert-base-NER/resolve/main/tokenizer_config.json
I am facing below issue while loading the pretrained BERT model from HuggingFace due to SSL certificate error.
Error:
SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url:…

Nikita Malviya
- 181
- 1
- 2
- 7
7
votes
1 answer
Key Error while fine tunning T5 for summarization with HuggingFace
I am trying to fine tune the T5 transformer for summarization but I am receiving a key error message:
KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'
The…

Johnpac
- 85
- 1
- 7
6
votes
1 answer
How padding in huggingface tokenizer works?
I tried following tokenization example:
tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True)
sent = "I hate this. Not that.",
_tokenized = tokenizer(sent, padding=True, max_length=20,…

MsA
- 2,599
- 3
- 22
- 47
6
votes
1 answer
transformers AutoTokenizer.tokenize introducing extra characters
I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with…

Ciaran
- 451
- 1
- 4
- 14
6
votes
2 answers
BERT get sentence embedding
I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding.
I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time.
Is there a way to expedite…

user2543622
- 5,760
- 25
- 91
- 159
6
votes
2 answers
what 's the meaning of "Using bos_token, but it is not set yet."
When I run the demo.py
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
model = AutoModel.from_pretrained("distilbert-base-multilingual-cased", return_dict=True)
#…

young
- 61
- 1
- 4
6
votes
2 answers
AutoTokenizer.from_pretrained fails to load locally saved pretrained tokenizer (PyTorch)
I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace.
I am successful in downloading and running them. But if I try to save them and load again, then some error…

ferty567
- 61
- 1
- 1
- 3
6
votes
1 answer
BertWordPieceTokenizer vs BertTokenizer from HuggingFace
I have the following pieces of code and trying to understand the difference between BertWordPieceTokenizer and BertTokenizer.
BertWordPieceTokenizer (Rust based)
from tokenizers import BertWordPieceTokenizer
sequence = "Hello, y'all! How are you…

HopeKing
- 3,317
- 7
- 39
- 62
5
votes
1 answer
why does huggingface t5 tokenizer ignore some of the whitespaces?
I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes…

Berkay Berabi
- 1,933
- 1
- 10
- 26
5
votes
1 answer
How to convert tokenized words back to the original ones after inference?
I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words.
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks…

deonardo_licaprio
- 308
- 1
- 11