Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
1
vote
0 answers

unsupported type () to a Tensor error in using tf.data.Dataset.from_tensor_slices

I am new to machine learning, I am implementing DialoGPT and trying to fine-tune it. But while fine-tuning I am facing issue while creating a dataset using tf.data.Dataset.from_tensor_slices. I am using the below code: tokenizerDialoGPT =…
1
vote
0 answers

RoBERTa tokenizer issue for certain characters

I am using the RobertaTokenizerFast to tokenize some sentences and align them with annotations. I noticed an issue with some chatacters from transformers import BatchEncoding, RobertaTokenizerFast from tokenizers import Encoding tokenizer =…
1
vote
0 answers

Customization of Wav2Vec2CTCTokenizer with rules

my goal is to fine-tune an ASR model, WavLM, that relies on the pretrained tokenizer Wav2Vec2CTCTokenizer. I want to fine-tune this ASR model with another language and to perform the tokenization according to phonological rules, such as syllable…
1
vote
2 answers

How can we pass a list of strings to a fine tuned bert model?

I want to pass a list of strings instead of a single string input to my fine tuned bert question classification model. This is my code which accept a single string input. questionclassification_model =…
1
vote
1 answer

How to use Huggingface Transformers with PrimeQA model?

Here is the model https://huggingface.co/PrimeQA/t5-base-table-question-generator Hugging face says that I should use the following code to use the model in transformers: from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer =…
1
vote
1 answer

How to use dataset with costume function?

I want to call DatasetDict map function with parameters, and I dont know how to do it. I have function with the following API: def tokenize_function(tokenizer, examples): s1 = examples["premise"] s2 = examples["hypothesis"] args = (s1,…
user3668129
  • 4,318
  • 6
  • 45
  • 87
1
vote
0 answers

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows: ['▁', '', '▁Hello', '▁', '', ''] But when i use the normal tokenizer, it starts to split special token "/s>" as follows: ['▁',…
canP
  • 25
  • 4
1
vote
1 answer

Equivalent to tokenizer() in Transformers 2.5.0?

I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0. # Converting pretrained BERT classification model to regression model # i.e. extracting base model and swapping out…
1
vote
1 answer

issue when importing BloomTokenizer from transformers in python

I am trying to import BloomTokenizer from transformers from transformers import BloomTokenizer and I receive the following error Traceback (most recent call last): File "", line 1, in ImportError: cannot import name…
1
vote
1 answer

How to preserve the original columns of a dataset when using Huggingface tokenizer?

When using Huggingface Tokenizer with return_overflowing_tokens=True, the results can have multiple token sequence per input string. Therefore, when doing a Dataset.map from strings to token sequence, you need to remove the original columns (as…
SRobertJames
  • 8,210
  • 14
  • 60
  • 107
1
vote
1 answer

huggingface longformer case sensitive tokenizer

This page shows how to build a longformer based classification. import pandas as pd import datasets from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig import torch.nn…
1
vote
1 answer

Hugginface Transformers Bert Tokenizer - Find out which documents get truncated

I am using the Transforms library from Huggingface to create a text classification model based on Bert. For this I tokenise my documents and I set truncation to be true as my documents are longer than allowed (512). How can I find out how many…
1
vote
1 answer

How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?

Say, I have three sample sentences: s0 = "This model was pretrained using a specific normalization pipeline available here!" s1 = "Thank to all the people around," s2 = "Bengali Mask Language Model for Bengali Language" I could make a batch…
1
vote
2 answers

Huggingface pretrained model's tokenizer and model objects have different maximum input length

I'm using symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli pretrained model from huggingface. My task requires to use it on pretty large texts, so it's essential to know maximum input length. The following code is supposed to load pretrained model…
1
vote
3 answers

Huggingface sagemaker

I am trying to use the text2text (translation) model facebook/m2m100_418M to run on sagemaker. So if you click on deploy and then sagemaker there is some boilerplate code that works well but I can't seem to find how to pass it the arguments…