Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
1
vote
0 answers
unsupported type () to a Tensor error in using tf.data.Dataset.from_tensor_slices
I am new to machine learning, I am implementing DialoGPT and trying to fine-tune it. But while fine-tuning I am facing issue while creating a dataset using tf.data.Dataset.from_tensor_slices.
I am using the below code:
tokenizerDialoGPT =…

Kshitij Sinha
- 11
- 2
1
vote
0 answers
RoBERTa tokenizer issue for certain characters
I am using the RobertaTokenizerFast to tokenize some sentences and align them with annotations. I noticed an issue with some chatacters
from transformers import BatchEncoding, RobertaTokenizerFast
from tokenizers import Encoding
tokenizer =…

Paschalis
- 191
- 10
1
vote
0 answers
Customization of Wav2Vec2CTCTokenizer with rules
my goal is to fine-tune an ASR model, WavLM, that relies on the pretrained tokenizer Wav2Vec2CTCTokenizer.
I want to fine-tune this ASR model with another language and to perform the tokenization according to phonological rules, such as syllable…

Sara Picciau
- 11
- 3
1
vote
2 answers
How can we pass a list of strings to a fine tuned bert model?
I want to pass a list of strings instead of a single string input to my fine tuned bert question classification model.
This is my code which accept a single string input.
questionclassification_model =…

Abin Jilson
- 35
- 6
1
vote
1 answer
How to use Huggingface Transformers with PrimeQA model?
Here is the model https://huggingface.co/PrimeQA/t5-base-table-question-generator
Hugging face says that I should use the following code to use the model in transformers:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer =…

Real Noob
- 1,369
- 2
- 15
- 29
1
vote
1 answer
How to use dataset with costume function?
I want to call DatasetDict map function with parameters, and I dont know how to do it.
I have function with the following API:
def tokenize_function(tokenizer, examples):
s1 = examples["premise"]
s2 = examples["hypothesis"]
args = (s1,…

user3668129
- 4,318
- 6
- 45
- 87
1
vote
0 answers
Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)
When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows:
['▁', '', '▁Hello', '▁', '', '']
But when i use the normal tokenizer, it starts to split special token "/s>" as follows:
['▁', 's', '>',…

canP
- 25
- 4
1
vote
1 answer
Equivalent to tokenizer() in Transformers 2.5.0?
I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0.
# Converting pretrained BERT classification model to regression model
# i.e. extracting base model and swapping out…

galactic_tok
- 13
- 5
1
vote
1 answer
issue when importing BloomTokenizer from transformers in python
I am trying to import BloomTokenizer from transformers
from transformers import BloomTokenizer
and I receive the following error
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name…

Marcel
- 167
- 2
- 11
1
vote
1 answer
How to preserve the original columns of a dataset when using Huggingface tokenizer?
When using Huggingface Tokenizer with return_overflowing_tokens=True, the results can have multiple token sequence per input string. Therefore, when doing a Dataset.map from strings to token sequence, you need to remove the original columns (as…

SRobertJames
- 8,210
- 14
- 60
- 107
1
vote
1 answer
huggingface longformer case sensitive tokenizer
This page shows how to build a longformer based classification.
import pandas as pd
import datasets
from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
import torch.nn…

user2543622
- 5,760
- 25
- 91
- 159
1
vote
1 answer
Hugginface Transformers Bert Tokenizer - Find out which documents get truncated
I am using the Transforms library from Huggingface to create a text classification model based on Bert. For this I tokenise my documents and I set truncation to be true as my documents are longer than allowed (512).
How can I find out how many…

Ethan Van den Bleeken
- 368
- 1
- 10
1
vote
1 answer
How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?
Say, I have three sample sentences:
s0 = "This model was pretrained using a specific normalization pipeline available here!"
s1 = "Thank to all the people around,"
s2 = "Bengali Mask Language Model for Bengali Language"
I could make a batch…

Abu Ubaida
- 95
- 6
1
vote
2 answers
Huggingface pretrained model's tokenizer and model objects have different maximum input length
I'm using symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli pretrained model from huggingface. My task requires to use it on pretty large texts, so it's essential to know maximum input length.
The following code is supposed to load pretrained model…

Nick Zorander
- 131
- 12
1
vote
3 answers
Huggingface sagemaker
I am trying to use the text2text (translation) model facebook/m2m100_418M to run on sagemaker.
So if you click on deploy and then sagemaker there is some boilerplate code that works well but I can't seem to find how to pass it the arguments…

Felix Verhulst
- 43
- 7