Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
2
votes
1 answer

pip on Docker image cannot find Rust - even though Rust is installed

I'm trying to install some Python packages, namely tokenizers from huggingface transformers, which apparently needs Rust. So I am installing Rust on my Docker build: FROM nikolaik/python-nodejs USER pn WORKDIR /home/pn/app COPY . /home/pn/app/ RUN…
lte__
  • 7,175
  • 25
  • 74
  • 131
2
votes
1 answer

405 : Client Error: Not Allowed for huggingface url

I am trying to follow the huggingface tutorial on finetuning models for summarization. All I'm trying is to load the t5 tokenizer. from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("t5-small") And I get the following…
2
votes
0 answers

How to get tokens to words in BERT tokenizer

I have a list, using higgingface bert tokenizer I can get the mapping numerical representation. X = ['[CLS]', '[MASK]', 'love', 'this', '[SEP]'] tokens = tokenizer.convert_tokens_to_ids(X) toekns: [101, 103, 2293, 2023, 102] Is there any function…
2
votes
2 answers

Getting an error even after using truncation for tokenizer while predicting (MLM) on bert using huggingface

I am using truncation=True in the tokenizer self.tokenizer = AutoTokenizer.from_pretrained(bert_model_str, truncation=True) self.pipeline = pipeline("fill-mask", model=self.model, tokenizer=self.tokenizer) however I am still getting multiple…
2
votes
0 answers

How can I combine a Huggingface tokenizer and a BERT-based model in onnx?

Problem description: I have a model based on BERT, with a classifier layer on top. I want to export it to ONNX, but to avoid issues on the side of the 'user' of the onnx model, I want to export the entire pipeline, including tokenization, as a ONNX…
2
votes
1 answer

Hugging face - Efficient tokenization of unknown token in GPT2

I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens. from transformers import ( AdamW, AutoConfig, AutoTokenizer, PreTrainedModel, …
2
votes
1 answer

How to get a probability distribution over tokens in a huggingface model?

I'm following this tutorial on getting predictions over masked words. The reason I'm using this one is because it seems to be working with several masked word simultaneously while other approaches I tried could only take 1 masked word at a time. The…
Penguin
  • 1,923
  • 3
  • 21
  • 51
2
votes
1 answer

Using huggingface library gives an error: KeyError: 'logits'

I'm new to the huggingface library and trying to run a model to do masked language ("fill-mask" task): from transformers import BertTokenizer, BertForMaskedLM import torch from transformers import pipeline, AutoTokenizer, AutoModel # Initialize MLM…
Penguin
  • 1,923
  • 3
  • 21
  • 51
2
votes
1 answer

Mapping huggingface tokens to original input text

How can I map the tokens I get from huggingface DistilBertTokenizer to the positions of the input text? e.g. I have a new GPU -> ["i", "have", "a", "new", "gp", "##u"] -> [(0, 1), (2, 6), ...] I'm interested in this because suppose that I have some…
2
votes
0 answers

Adding 'decoder_start_token_id' with SimpleTransformers

Training MBART in Seq2Seq with SimpleTransformers but getting an error I am not seeing with BART: TypeError: shift_tokens_right() missing 1 required positional argument: 'decoder_start_token_id' So far I've tried various combinations…
2
votes
1 answer

Huggingface Tokenizer object is not callable

I am creating a deep learning code that embeds text into BERT based embedding. I am seeing unexpected issues in a code that was working fine before. Below is the snippet: sentences = ["person in red riding a motorcycle", "lady cutting cheese with…
amitgh
  • 61
  • 1
  • 6
2
votes
0 answers

Is there a tokenizer that can find sentence boundaries and apply BPE at the same time?

There seem to be lots and lots of libraries out there that can find sentence boundaries. The reason I need to find these is to chunk up longer texts so I can send them to language models. This means once I have my chunks made up of complete…
2
votes
1 answer

AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'to_tensor'

I'm fine-tuning a BERT model using Hugging Face, Keras, Tensorflow libraries. Since yesterday I'm getting this error running my code in Google Colab. The odd thing is that the code used to run without any problem and suddenly started to throw this…
2
votes
0 answers

How long does load_dataset take time in huggingface?

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code: import datasets from t5_tokenizer_model import SentencePieceUnigramTokenizer vocab_size = 32_000 input_sentence_size = None # Initialize a…
2
votes
1 answer

Which loss function to use for training sparse multi-label text classification problem and class skewness/imbalance

I am training a sparse multi-label text classification problem using Hugging Face models which is one part of SMART REPLY System. The task which I am doing is mentioned below: I classify Customer Utterances as input to the model and classify to…