Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
3
votes
1 answer

Fast and slow tokenizers yield different results

Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer. Specifically, when I run the fill-mask pipeline, the probabilities assigned to the words that would…
2
votes
0 answers

SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…
2
votes
2 answers

How to do Tokenizer Batch processing? - HuggingFace

in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says: text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of…
2
votes
1 answer

How to use HuggingFace Inference endpoints for both tokenization and inference?

I am trying to set up separate endpoints for tokenization and inference using HuggingFace models. Ideally I would like to use HuggingFace inference endpoints. Is there a straightforward way to spin up endpoints for encoding, decoding, and inference…
2
votes
1 answer

max_seq_length for transformer (Sentence-BERT)

I'm using sentence-BERT from Huggingface in the following way: from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') model.max_seq_length = 512 model.encode(text) When text is long and contains more…
2
votes
1 answer

Getting an exception with fine-tuning of model

I am trying to fine-tune a model. There is a dataset: [ { "sample": [ " Какие советы помогут вам составить успешный бизнес-план?", "\n1. Изучите свой целевой рынок: поймите, кому вы продаете, насколько велика конкуренция и текущие…
2
votes
1 answer

T5 fine tuned model outputs instead of curly braces and other special characters

First of I'll start with saying that I'm a beginner when it comes to machine learning as a whole and transformers so my apologies if it's a dumb question. I've been fine tuning t5 for the task of generating mongodb queries, but I was met with this…
2
votes
0 answers

How to ensure last token in sequence is end-of-sequence token?

I am using the gpt2 model from huggingface's transformers library. When tokenizing, I would like all sequences to end in the end-of-sequence (EOS) token. How can I do this? An easy solution is to manually append the EOS token to each sequence in a…
BioBroo
  • 613
  • 1
  • 7
  • 21
2
votes
0 answers

Transformer train a new tokenizer base on existing one

In the following code from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese") tokenizer_new = tokenizer.train_new_from_iterator(training_corpus, 50000, new_special_tokens = ['健康','医学','试剂盒',....]) where…
2
votes
0 answers

How does Huggingface's tokenizers tokenize non-English characters?

I use tokenizers to tokenize natural language sentences into tokens. But came up with some questions: Here is some examples I tried using tokenizers: from transformers import GPT2TokenizerFast tokenizer =…
dongrixinyu
  • 172
  • 2
  • 14
2
votes
0 answers

how to make BERT predict new token

my problem looks like this: tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM.from_pretrained('bert-base-uncased') fill_mask_pipeline_pre = pipeline("fill-mask", model=model, tokenizer=tokenizer) sentence_test =…
2
votes
1 answer

How to load a saved model for a Hoggingface T5 model where the tokenizer was extended in the training phase?

I use the following code to load the saved model: config = T5Config.from_pretrained( model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if…
Ahmad
  • 8,811
  • 11
  • 76
  • 141
2
votes
1 answer

T5 model generates short output

I have fine-tuned the T5-base model (from hugging face) on a new task where each input and target are sentences of 256 words. The loss is converging to low values however when I use the generate method the output is always too short. I tried giving…
Tamir
  • 1,224
  • 1
  • 5
  • 18
2
votes
0 answers

I need to make a Pre Trained Tokenizer (Hugging Face) safer for privacy

I am new to NLP and Transformers library. Perhaps my doubt is naive but I am not finding a good solution for it. I have documents whose content in sensitive and it is a requirements of mine not to publish it clearly on cloud. However my model is…
2
votes
0 answers

Reducing Latency for GPT-J

I'm using GPT-J locally on a Nvidia RTX 3090 GPU. Currently, I'm using the model in the following way: config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B") tokenizer =…