Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
1
vote
1 answer

How to add new tokens to an existing Huggingface tokenizer?

How to add new tokens to an existing Huggingface AutoTokenizer? Canonically, there's this tutorial from Huggingface https://huggingface.co/learn/nlp-course/chapter6/2 but it ends on the note of "quirks when using existing tokenizers". And then it…
1
vote
0 answers

Error when loading a BERT model using load_model after adding a new token to the tokenizer

I am getting this error when trying to load the saved .h5 model in tensorflow: load_model("path_to_model.h5", custom_objects={"TFBertModel": TFBertModel, "AdamWeightDecay": AdamWeightDecay}) ValueError: Cannot assign value to variable '…
1
vote
0 answers

The tokenizer Doesn't recognize the new special tokens

When I run the code below, the tokenizer doesn't recognize the new special tokens that I added ([SP] and [EMPTY]). I wanted to tokenize arabic text. from tokenizers import BertWordPieceTokenizer from transformers import…
FQ912
  • 11
  • 2
1
vote
0 answers

DataFrame text tokenization with Hugging Face is not working

I have a DataFrame with text I want to tokenize using the Hugging Face library. When running the code, the "Tokenized Text" column returns empty. How can this be solved? The code is as follows: df = pd.read_csv('subject_messages.csv') import…
1
vote
1 answer

GPU out of memory fine tune flan-ul2

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.78 GiB total capacity; 14.99 GiB already allocated; 3.50 MiB free; 14.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting…
1
vote
0 answers

Is it possible to use Tiktoken's ck_100k_base Tokenizer in HuggingFace's pipeline?

I can use Tiktoken's ck_100k_base Tokenizer to encode text data. import tiktoken enc = tiktoken.get_encoding("ck_100k_base") ids = enc.encode_ordinary('hello world') print(ids) which will tokenized output: [15339, 1917] While in HuggingFace, I use…
Raptor
  • 53,206
  • 45
  • 230
  • 366
1
vote
1 answer

Unable to import transformers.models.bert.modeling_tf_bert on macOS?

As the title is self-descriptive, I'm not able to import the BertTokenizer and TFBertModel classes from the transformers package through the following code: from transformers import BertTokenizer, TFBertModel tokenizer =…
1
vote
1 answer

Avoiding Trimmed Summaries of a PEGASUS-Pubmed huggingface summarization model

I am new to huggingface. I am using PEGASUS - Pubmed huggingface model to generate summary of the reserach paper. Following is the code for the same. the model gives a trimmed summary. Any way of avoiding the trimmed summaries and getting more…
1
vote
1 answer

How to interpret the model_max_len attribute of the PreTrainedTokenizer object in Huggingface Transformers

I've been trying to check the maximum length allowed by emilyalsentzer/Bio_ClinicalBERT, and after these lines of code: model_name = "emilyalsentzer/Bio_ClinicalBERT" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer I've obtained the…
1
vote
0 answers

Conflicting versions of adapter-transformers with versions of tokenizers

I am trying to install adapter-transformers=3.1.0, then it give following error. How could I find the compatible adapter-transformers version for tokenizers==0.9.2? ERROR: pip's dependency resolver does not currently take into account all the…
1
vote
1 answer

How does one create a custom hugging face model that is compatible with the HF trainer?

I want to create a new hugging face (HF) architecture with some existing tokenizer (any one that is excellent is fine). Let's say decoder to make it concrete (but both is better). How does one do this? I found this…
1
vote
1 answer

Train Tokenizer with HuggingFace dataset

I'm trying to train the Tokenizer with HuggingFace wiki_split datasets. According to the Tokenizers' documentation at GitHub, I can train the Tokenizer with the following codes: from tokenizers import Tokenizer from tokenizers.models import…
Raptor
  • 53,206
  • 45
  • 230
  • 366
1
vote
3 answers

tokenizer.push_to_hub(repo_name) is not working

I'm trying to puch my tokonizer to my huggingface repo... it consist of the model vocab.Json (I'm making a speech recognition model) My code: vocab_dict["|"] = vocab_dict[" "] del vocab_dict[" "] vocab_dict["[UNK]"] =…
1
vote
1 answer

Huggingface token classification pipeline giving different outputs than just calling model() directly

I am trying to mask named entities in text, using a roberta based model. The suggested way to use the model is via Huggingface pipeline but i find that it is rather slow to use it that way. Using a pipeline on text data also prevents me from using…
1
vote
0 answers

mT5 transformer, how to access encoder to compute cosine similarity

this is my method, my question is how to access the encoder be sending 2 sentences each time? because I have a dataset that contain pairs of sentences, and I need to compute the similarity between each pair. //anyone could help? model =…