Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
2
votes
1 answer
pip on Docker image cannot find Rust - even though Rust is installed
I'm trying to install some Python packages, namely tokenizers from huggingface transformers, which apparently needs Rust. So I am installing Rust on my Docker build:
FROM nikolaik/python-nodejs
USER pn
WORKDIR /home/pn/app
COPY . /home/pn/app/
RUN…

lte__
- 7,175
- 25
- 74
- 131
2
votes
1 answer
405 : Client Error: Not Allowed for huggingface url
I am trying to follow the huggingface tutorial on finetuning models for summarization.
All I'm trying is to load the t5 tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")
And I get the following…

Kiera.K
- 317
- 1
- 13
2
votes
0 answers
How to get tokens to words in BERT tokenizer
I have a list, using higgingface bert tokenizer I can get the mapping numerical representation.
X = ['[CLS]', '[MASK]', 'love', 'this', '[SEP]']
tokens = tokenizer.convert_tokens_to_ids(X)
toekns: [101, 103, 2293, 2023, 102]
Is there any function…

kowser66
- 125
- 1
- 8
2
votes
2 answers
Getting an error even after using truncation for tokenizer while predicting (MLM) on bert using huggingface
I am using truncation=True in the tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(bert_model_str, truncation=True)
self.pipeline = pipeline("fill-mask", model=self.model, tokenizer=self.tokenizer)
however I am still getting multiple…

Coddy
- 549
- 4
- 18
2
votes
0 answers
How can I combine a Huggingface tokenizer and a BERT-based model in onnx?
Problem description:
I have a model based on BERT, with a classifier layer on top. I want to export it to ONNX, but to avoid issues on the side of the 'user' of the onnx model, I want to export the entire pipeline, including tokenization, as a ONNX…

Kroshtan
- 637
- 5
- 17
2
votes
1 answer
Hugging face - Efficient tokenization of unknown token in GPT2
I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens.
from transformers import (
AdamW,
AutoConfig,
AutoTokenizer,
PreTrainedModel,
…

Soumya Ranjan Sahoo
- 133
- 2
- 9
2
votes
1 answer
How to get a probability distribution over tokens in a huggingface model?
I'm following this tutorial on getting predictions over masked words. The reason I'm using this one is because it seems to be working with several masked word simultaneously while other approaches I tried could only take 1 masked word at a time.
The…

Penguin
- 1,923
- 3
- 21
- 51
2
votes
1 answer
Using huggingface library gives an error: KeyError: 'logits'
I'm new to the huggingface library and trying to run a model to do masked language ("fill-mask" task):
from transformers import BertTokenizer, BertForMaskedLM
import torch
from transformers import pipeline, AutoTokenizer, AutoModel
# Initialize MLM…

Penguin
- 1,923
- 3
- 21
- 51
2
votes
1 answer
Mapping huggingface tokens to original input text
How can I map the tokens I get from huggingface DistilBertTokenizer to the positions of the input text?
e.g. I have a new GPU -> ["i", "have", "a", "new", "gp", "##u"] -> [(0, 1), (2, 6), ...]
I'm interested in this because suppose that I have some…

Hardian Lawi
- 588
- 5
- 22
2
votes
0 answers
Adding 'decoder_start_token_id' with SimpleTransformers
Training MBART in Seq2Seq with SimpleTransformers but getting an error I am not seeing with BART:
TypeError: shift_tokens_right() missing 1 required positional argument: 'decoder_start_token_id'
So far I've tried various combinations…

LeOverflow
- 301
- 1
- 2
- 16
2
votes
1 answer
Huggingface Tokenizer object is not callable
I am creating a deep learning code that embeds text into BERT based embedding. I am seeing unexpected issues in a code that was working fine before. Below is the snippet:
sentences = ["person in red riding a motorcycle", "lady cutting cheese with…

amitgh
- 61
- 1
- 6
2
votes
0 answers
Is there a tokenizer that can find sentence boundaries and apply BPE at the same time?
There seem to be lots and lots of libraries out there that can find sentence boundaries.
The reason I need to find these is to chunk up longer texts so I can send them to language models.
This means once I have my chunks made up of complete…

rudolfovic
- 3,163
- 2
- 14
- 38
2
votes
1 answer
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'to_tensor'
I'm fine-tuning a BERT model using Hugging Face, Keras, Tensorflow libraries.
Since yesterday I'm getting this error running my code in Google Colab. The odd thing is that the code used to run without any problem and suddenly started to throw this…

ipietri
- 21
- 1
- 3
2
votes
0 answers
How long does load_dataset take time in huggingface?
I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a…

Ahmad
- 8,811
- 11
- 76
- 141
2
votes
1 answer
Which loss function to use for training sparse multi-label text classification problem and class skewness/imbalance
I am training a sparse multi-label text classification problem using Hugging Face models which is one part of SMART REPLY System. The task which I am doing is mentioned below:
I classify Customer Utterances as input to the model and classify to…

MAC
- 1,345
- 2
- 30
- 60