Highest Voted 'huggingface-tokenizers' Questions

1

vote

1 answer

How to add new tokens to an existing Huggingface tokenizer?

How to add new tokens to an existing Huggingface AutoTokenizer? Canonically, there's this tutorial from Huggingface https://huggingface.co/learn/nlp-course/chapter6/2 but it ends on the note of "quirks when using existing tokenizers". And then it…

asked May 08 '23 at 06:41

alvas

115,346
109
446
738

1

vote

0 answers

Error when loading a BERT model using load_model after adding a new token to the tokenizer

I am getting this error when trying to load the saved .h5 model in tensorflow: load_model("path_to_model.h5", custom_objects={"TFBertModel": TFBertModel, "AdamWeightDecay": AdamWeightDecay}) ValueError: Cannot assign value to variable '…

python tensorflow keras huggingface-transformers huggingface-tokenizers

asked Apr 28 '23 at 16:32

richard

11
1

1

vote

0 answers

The tokenizer Doesn't recognize the new special tokens

When I run the code below, the tokenizer doesn't recognize the new special tokens that I added ([SP] and [EMPTY]). I wanted to tokenize arabic text. from tokenizers import BertWordPieceTokenizer from transformers import…

bert-language-model huggingface-tokenizers

asked Apr 28 '23 at 08:21

FQ912

11
2

1

vote

0 answers

DataFrame text tokenization with Hugging Face is not working

I have a DataFrame with text I want to tokenize using the Hugging Face library. When running the code, the "Tokenized Text" column returns empty. How can this be solved? The code is as follows: df = pd.read_csv('subject_messages.csv') import…

python pandas nlp huggingface-transformers huggingface-tokenizers

asked Apr 27 '23 at 17:42

Mark Davidson

23
3

1

vote

1 answer

GPU out of memory fine tune flan-ul2

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.78 GiB total capacity; 14.99 GiB already allocated; 3.50 MiB free; 14.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting…

gpu huggingface-transformers huggingface-tokenizers gpt-3 fine-tune

asked Apr 20 '23 at 18:13

Salty Gold Fish

431
5
14

1

vote

0 answers

Is it possible to use Tiktoken's ck_100k_base Tokenizer in HuggingFace's pipeline?

I can use Tiktoken's ck_100k_base Tokenizer to encode text data. import tiktoken enc = tiktoken.get_encoding("ck_100k_base") ids = enc.encode_ordinary('hello world') print(ids) which will tokenized output: [15339, 1917] While in HuggingFace, I use…

nlp huggingface-tokenizers huggingface

asked Apr 19 '23 at 08:05

Raptor

53,206
45
230
366

1

vote

1 answer

Unable to import transformers.models.bert.modeling_tf_bert on macOS?

As the title is self-descriptive, I'm not able to import the BertTokenizer and TFBertModel classes from the transformers package through the following code: from transformers import BertTokenizer, TFBertModel tokenizer =…

tensorflow huggingface-transformers huggingface-tokenizers huggingface nlp-question-answering

asked Apr 15 '23 at 22:40

talha06

6,206
21
92
147

1

vote

1 answer

Avoiding Trimmed Summaries of a PEGASUS-Pubmed huggingface summarization model

I am new to huggingface. I am using PEGASUS - Pubmed huggingface model to generate summary of the reserach paper. Following is the code for the same. the model gives a trimmed summary. Any way of avoiding the trimmed summaries and getting more…

pytorch nlp huggingface-transformers huggingface-tokenizers huggingface

asked Apr 10 '23 at 12:00

Simran

15
3

1

vote

1 answer

How to interpret the model_max_len attribute of the PreTrainedTokenizer object in Huggingface Transformers

I've been trying to check the maximum length allowed by emilyalsentzer/Bio_ClinicalBERT, and after these lines of code: model_name = "emilyalsentzer/Bio_ClinicalBERT" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer I've obtained the…

python nlp huggingface-transformers huggingface-tokenizers huggingface

asked Apr 01 '23 at 13:13

ignacioct

325
1
12

1

vote

0 answers

Conflicting versions of adapter-transformers with versions of tokenizers

I am trying to install adapter-transformers=3.1.0, then it give following error. How could I find the compatible adapter-transformers version for tokenizers==0.9.2? ERROR: pip's dependency resolver does not currently take into account all the…

python pip huggingface-transformers huggingface-tokenizers

asked Mar 12 '23 at 18:24

Indunil Udayangana

41
1
3

1

vote

1 answer

How does one create a custom hugging face model that is compatible with the HF trainer?

I want to create a new hugging face (HF) architecture with some existing tokenizer (any one that is excellent is fine). Let's say decoder to make it concrete (but both is better). How does one do this? I found this…

deep-learning pytorch huggingface-transformers huggingface-tokenizers huggingface

asked Feb 22 '23 at 17:24

Charlie Parker

5,884
57
198
323

1

vote

1 answer

Train Tokenizer with HuggingFace dataset

I'm trying to train the Tokenizer with HuggingFace wiki_split datasets. According to the Tokenizers' documentation at GitHub, I can train the Tokenizer with the following codes: from tokenizers import Tokenizer from tokenizers.models import…

python huggingface-tokenizers

asked Feb 21 '23 at 09:50

Raptor

53,206
45
230
366

1

vote

3 answers

tokenizer.push_to_hub(repo_name) is not working

I'm trying to puch my tokonizer to my huggingface repo... it consist of the model vocab.Json (I'm making a speech recognition model) My code: vocab_dict["|"] = vocab_dict[" "] del vocab_dict[" "] vocab_dict["[UNK]"] =…

python pytorch huggingface-transformers huggingface-tokenizers huggingface

asked Feb 08 '23 at 11:29

FOXASDF

43
3

1

vote

1 answer

Huggingface token classification pipeline giving different outputs than just calling model() directly

I am trying to mask named entities in text, using a roberta based model. The suggested way to use the model is via Huggingface pipeline but i find that it is rather slow to use it that way. Using a pipeline on text data also prevents me from using…

pytorch huggingface-transformers named-entity-recognition huggingface-tokenizers huggingface

asked Jan 25 '23 at 21:46

Bunnyrabbit

21
3

1

vote

0 answers

mT5 transformer, how to access encoder to compute cosine similarity

this is my method, my question is how to access the encoder be sending 2 sentences each time? because I have a dataset that contain pairs of sentences, and I need to compute the similarity between each pair. //anyone could help? model =…

dataset huggingface-transformers cosine-similarity huggingface-tokenizers huggingface

asked Jan 05 '23 at 23:43

Maria

11
2

Questions tagged [huggingface-tokenizers]