Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
8
votes
6 answers

Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers

I'm trying to build the model illustrated in this picture: I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way: from transformers import AutoTokenizer, TFBertModel model_name =…
8
votes
2 answers

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased). QUERY: I want to ask a question. ANSWER: Sure, ask away. QUERY: How is the weather today? ANSWER: It is…
sid8491
  • 6,622
  • 6
  • 38
  • 64
8
votes
1 answer

what is the difference between len(tokenizer) and tokenizer.vocab_size

I'm trying to add a few new words to the vocabulary of a pretrained HuggingFace Transformers model. I did the following to change the vocabulary of the tokenizer and also increase the embedding size of the model: tokenizer.add_tokens(['word1',…
8
votes
0 answers

How to use threads for huggingface transformers

I'm trying to run a hugging face model, mode exactly "cardiffnlp/twitter-roberta-base-sentiment" on threads. But at the same time, I want just one single instance of it because it's really costly in terms of time. In other words, I have multiple CSV…
8
votes
1 answer

Do I need to pre-tokenize the text first before using HuggingFace's RobertaTokenizer? (Different undersanding)

I feel confused when using the Roberta tokenizer in Huggingface. >>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base') >>> x = tokenizer.tokenize("The tiger is ___ (big) than the dog.") ['The', 'Ġtiger', 'Ġis', 'Ġ___', 'Ġ(', 'big', ')',…
Allan-J
  • 336
  • 4
  • 11
7
votes
5 answers

SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /dslim/bert-base-NER/resolve/main/tokenizer_config.json

I am facing below issue while loading the pretrained BERT model from HuggingFace due to SSL certificate error. Error: SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url:…
7
votes
1 answer

Key Error while fine tunning T5 for summarization with HuggingFace

I am trying to fine tune the T5 transformer for summarization but I am receiving a key error message: KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers' The…
6
votes
1 answer

How padding in huggingface tokenizer works?

I tried following tokenization example: tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True) sent = "I hate this. Not that.", _tokenized = tokenizer(sent, padding=True, max_length=20,…
6
votes
1 answer

transformers AutoTokenizer.tokenize introducing extra characters

I am using HuggingFace transformers AutoTokenizer to tokenize small segments of text. However this tokenization is splitting incorrectly in the middle of words and introducing # characters to the tokens. I have tried several different models with…
Ciaran
  • 451
  • 1
  • 4
  • 14
6
votes
2 answers

BERT get sentence embedding

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding. I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time. Is there a way to expedite…
6
votes
2 answers

what 's the meaning of "Using bos_token, but it is not set yet."

When I run the demo.py from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased") model = AutoModel.from_pretrained("distilbert-base-multilingual-cased", return_dict=True) #…
6
votes
2 answers

AutoTokenizer.from_pretrained fails to load locally saved pretrained tokenizer (PyTorch)

I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace. I am successful in downloading and running them. But if I try to save them and load again, then some error…
6
votes
1 answer

BertWordPieceTokenizer vs BertTokenizer from HuggingFace

I have the following pieces of code and trying to understand the difference between BertWordPieceTokenizer and BertTokenizer. BertWordPieceTokenizer (Rust based) from tokenizers import BertWordPieceTokenizer sequence = "Hello, y'all! How are you…
5
votes
1 answer

why does huggingface t5 tokenizer ignore some of the whitespaces?

I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes…
5
votes
1 answer

How to convert tokenized words back to the original ones after inference?

I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words. # example input df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks…
1
2
3
29 30