Highest Voted 'huggingface-tokenizers' Questions

3

votes

1 answer

Fast and slow tokenizers yield different results

Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer. Specifically, when I run the fill-mask pipeline, the probabilities assigned to the words that would…

asked Apr 12 '20 at 03:32

Michael

143
1
7

2

votes

0 answers

SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that…

nlp huggingface huggingface-tokenizers sentencepiece byte-pair-encoding

asked Aug 02 '23 at 08:58

Shital Shah

63,284
17
238
185

2

votes

2 answers

How to do Tokenizer Batch processing? - HuggingFace

in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says: text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of…

pytorch batch-processing tokenize huggingface-transformers huggingface-tokenizers

asked Jun 07 '23 at 10:15

Lucas Azevedo

1,867
22
39

2

votes

1 answer

How to use HuggingFace Inference endpoints for both tokenization and inference?

I am trying to set up separate endpoints for tokenization and inference using HuggingFace models. Ideally I would like to use HuggingFace inference endpoints. Is there a straightforward way to spin up endpoints for encoding, decoding, and inference…

huggingface-transformers huggingface-tokenizers huggingface

asked Jun 02 '23 at 15:09

Steven Krawczyk

21
1

2

votes

1 answer

max_seq_length for transformer (Sentence-BERT)

I'm using sentence-BERT from Huggingface in the following way: from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') model.max_seq_length = 512 model.encode(text) When text is long and contains more…

nlp huggingface-transformers bert-language-model huggingface-tokenizers sentence-transformers

asked Mar 31 '23 at 17:29

BlackHawk

719
1
6
18

2

votes

1 answer

Getting an exception with fine-tuning of model

I am trying to fine-tune a model. There is a dataset: [ { "sample": [ " Какие советы помогут вам составить успешный бизнес-план?", "\n1. Изучите свой целевой рынок: поймите, кому вы продаете, насколько велика конкуренция и текущие…

python machine-learning pytorch huggingface-tokenizers

asked Mar 27 '23 at 11:46

Ubuty_programmist_7

308
7

2

votes

1 answer

T5 fine tuned model outputs instead of curly braces and other special characters

First of I'll start with saying that I'm a beginner when it comes to machine learning as a whole and transformers so my apologies if it's a dumb question. I've been fine tuning t5 for the task of generating mongodb queries, but I was met with this…

python machine-learning huggingface-transformers huggingface-tokenizers t5-transformer

asked Mar 26 '23 at 22:57

zaki Miho

51
5

2

votes

0 answers

How to ensure last token in sequence is end-of-sequence token?

I am using the gpt2 model from huggingface's transformers library. When tokenizing, I would like all sequences to end in the end-of-sequence (EOS) token. How can I do this? An easy solution is to manually append the EOS token to each sequence in a…

huggingface-tokenizers huggingface

asked Mar 06 '23 at 16:44

BioBroo

613
1
7
21

2

votes

0 answers

Transformer train a new tokenizer base on existing one

In the following code from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese") tokenizer_new = tokenizer.train_new_from_iterator(training_corpus, 50000, new_special_tokens = ['健康','医学','试剂盒',....]) where…

python nlp tokenize transformer-model huggingface-tokenizers

asked Feb 23 '23 at 20:06

Katelynn ruan

332
4
13

2

votes

0 answers

How does Huggingface's tokenizers tokenize non-English characters?

I use tokenizers to tokenize natural language sentences into tokens. But came up with some questions: Here is some examples I tried using tokenizers: from transformers import GPT2TokenizerFast tokenizer =…

nlp tokenize huggingface-tokenizers gpt-3 gpt-2

asked Feb 16 '23 at 07:39

dongrixinyu

172
2
14

2

votes

0 answers

how to make BERT predict new token

my problem looks like this: tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM.from_pretrained('bert-base-uncased') fill_mask_pipeline_pre = pipeline("fill-mask", model=model, tokenizer=tokenizer) sentence_test =…

nlp huggingface-transformers bert-language-model huggingface-tokenizers

asked Jan 13 '23 at 16:02

Maximilian Huber

21
1

2

votes

1 answer

How to load a saved model for a Hoggingface T5 model where the tokenizer was extended in the training phase?

I use the following code to load the saved model: config = T5Config.from_pretrained( model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, use_auth_token=True if…

huggingface-transformers huggingface-tokenizers

asked Jan 10 '23 at 21:04

Ahmad

8,811
11
76
141

2

votes

1 answer

T5 model generates short output

I have fine-tuned the T5-base model (from hugging face) on a new task where each input and target are sentences of 256 words. The loss is converging to low values however when I use the generate method the output is always too short. I tried giving…

python pytorch huggingface-transformers huggingface-tokenizers

asked Jan 02 '23 at 10:00

Tamir

1,224
1
5
18

2

votes

0 answers

I need to make a Pre Trained Tokenizer (Hugging Face) safer for privacy

I am new to NLP and Transformers library. Perhaps my doubt is naive but I am not finding a good solution for it. I have documents whose content in sensitive and it is a requirements of mine not to publish it clearly on cloud. However my model is…

nlp data-science ocr tokenize huggingface-tokenizers

asked Dec 30 '22 at 11:31

xxfeffo

41
2

2

votes

0 answers

Reducing Latency for GPT-J

I'm using GPT-J locally on a Nvidia RTX 3090 GPU. Currently, I'm using the model in the following way: config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B") tokenizer =…

huggingface-transformers huggingface-tokenizers huggingface

asked Dec 09 '22 at 14:45

BlackHawk

719
1
6
18

Questions tagged [huggingface-tokenizers]