Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
3
votes
1 answer
Fast and slow tokenizers yield different results
Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer.
Specifically, when I run the fill-mask pipeline, the probabilities assigned to the words that would…

Michael
- 143
- 1
- 7
2
votes
0 answers
SentencePiece tokenizer encodes to unknown token
I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…

Shital Shah
- 63,284
- 17
- 238
- 185
2
votes
2 answers
How to do Tokenizer Batch processing? - HuggingFace
in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says:
text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of…

Lucas Azevedo
- 1,867
- 22
- 39
2
votes
1 answer
How to use HuggingFace Inference endpoints for both tokenization and inference?
I am trying to set up separate endpoints for tokenization and inference using HuggingFace models. Ideally I would like to use HuggingFace inference endpoints.
Is there a straightforward way to spin up endpoints for encoding, decoding, and inference…

Steven Krawczyk
- 21
- 1
2
votes
1 answer
max_seq_length for transformer (Sentence-BERT)
I'm using sentence-BERT from Huggingface in the following way:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
model.max_seq_length = 512
model.encode(text)
When text is long and contains more…

BlackHawk
- 719
- 1
- 6
- 18
2
votes
1 answer
Getting an exception with fine-tuning of model
I am trying to fine-tune a model. There is a dataset:
[
{
"sample": [
" Какие советы помогут вам составить успешный бизнес-план?",
"\n1. Изучите свой целевой рынок: поймите, кому вы продаете, насколько велика конкуренция и текущие…

Ubuty_programmist_7
- 308
- 7
2
votes
1 answer
T5 fine tuned model outputs instead of curly braces and other special characters
First of I'll start with saying that I'm a beginner when it comes to machine learning as a whole and transformers so my apologies if it's a dumb question.
I've been fine tuning t5 for the task of generating mongodb queries, but I was met with this…

zaki Miho
- 51
- 5
2
votes
0 answers
How to ensure last token in sequence is end-of-sequence token?
I am using the gpt2 model from huggingface's transformers library. When tokenizing, I would like all sequences to end in the end-of-sequence (EOS) token. How can I do this?
An easy solution is to manually append the EOS token to each sequence in a…

BioBroo
- 613
- 1
- 7
- 21
2
votes
0 answers
Transformer train a new tokenizer base on existing one
In the following code
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
tokenizer_new = tokenizer.train_new_from_iterator(training_corpus, 50000, new_special_tokens = ['健康','医学','试剂盒',....])
where…

Katelynn ruan
- 332
- 4
- 13
2
votes
0 answers
How does Huggingface's tokenizers tokenize non-English characters?
I use tokenizers to tokenize natural language sentences into tokens.
But came up with some questions:
Here is some examples I tried using tokenizers:
from transformers import GPT2TokenizerFast
tokenizer =…

dongrixinyu
- 172
- 2
- 14
2
votes
0 answers
how to make BERT predict new token
my problem looks like this:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
fill_mask_pipeline_pre = pipeline("fill-mask", model=model, tokenizer=tokenizer)
sentence_test =…

Maximilian Huber
- 21
- 1
2
votes
1 answer
How to load a saved model for a Hoggingface T5 model where the tokenizer was extended in the training phase?
I use the following code to load the saved model:
config = T5Config.from_pretrained(
model_name_or_path,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if…

Ahmad
- 8,811
- 11
- 76
- 141
2
votes
1 answer
T5 model generates short output
I have fine-tuned the T5-base model (from hugging face) on a new task where each input and target are sentences of 256 words.
The loss is converging to low values however when I use the generate method the output is always too short.
I tried giving…

Tamir
- 1,224
- 1
- 5
- 18
2
votes
0 answers
I need to make a Pre Trained Tokenizer (Hugging Face) safer for privacy
I am new to NLP and Transformers library. Perhaps my doubt is naive but I am not finding a good solution for it.
I have documents whose content in sensitive and it is a requirements of mine not to publish it clearly on cloud. However my model is…

xxfeffo
- 41
- 2
2
votes
0 answers
Reducing Latency for GPT-J
I'm using GPT-J locally on a Nvidia RTX 3090 GPU. Currently, I'm using the model in the following way:
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer =…

BlackHawk
- 719
- 1
- 6
- 18