Highest Voted 'huggingface-tokenizers' Questions

1

vote

0 answers

unsupported type () to a Tensor error in using tf.data.Dataset.from_tensor_slices

I am new to machine learning, I am implementing DialoGPT and trying to fine-tune it. But while fine-tuning I am facing issue while creating a dataset using tf.data.Dataset.from_tensor_slices. I am using the below code: tokenizerDialoGPT =…

asked Aug 30 '22 at 10:07

Kshitij Sinha

11
2

1

vote

0 answers

RoBERTa tokenizer issue for certain characters

I am using the RobertaTokenizerFast to tokenize some sentences and align them with annotations. I noticed an issue with some chatacters from transformers import BatchEncoding, RobertaTokenizerFast from tokenizers import Encoding tokenizer =…

python huggingface-transformers huggingface-tokenizers roberta-language-model roberta

asked Aug 30 '22 at 09:18

Paschalis

191
10

1

vote

0 answers

Customization of Wav2Vec2CTCTokenizer with rules

my goal is to fine-tune an ASR model, WavLM, that relies on the pretrained tokenizer Wav2Vec2CTCTokenizer. I want to fine-tune this ASR model with another language and to perform the tokenization according to phonological rules, such as syllable…

python nlp linguistics huggingface-tokenizers

asked Aug 24 '22 at 16:28

Sara Picciau

11
3

1

vote

2 answers

How can we pass a list of strings to a fine tuned bert model?

I want to pass a list of strings instead of a single string input to my fine tuned bert question classification model. This is my code which accept a single string input. questionclassification_model =…

python nlp huggingface-transformers bert-language-model huggingface-tokenizers

asked Aug 17 '22 at 05:46

Abin Jilson

35
6

1

vote

1 answer

How to use Huggingface Transformers with PrimeQA model?

Here is the model https://huggingface.co/PrimeQA/t5-base-table-question-generator Hugging face says that I should use the following code to use the model in transformers: from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer =…

python nlp huggingface-transformers huggingface-tokenizers nlp-question-answering

asked Aug 15 '22 at 05:47

Real Noob

1,369
2
15
29

1

vote

1 answer

How to use dataset with costume function?

I want to call DatasetDict map function with parameters, and I dont know how to do it. I have function with the following API: def tokenize_function(tokenizer, examples): s1 = examples["premise"] s2 = examples["hypothesis"] args = (s1,…

huggingface-tokenizers huggingface-datasets huggingface

asked Aug 08 '22 at 12:50

user3668129

4,318
6
45
87

1

vote

0 answers

Slow and Fast tokenizer gives different outputs(sentencepiece tokenization)

When i use T5TokenizerFast(Tokenizer of T5 architecture), the output is expected as follows: ['▁', '', '▁Hello', '▁', '', ''] But when i use the normal tokenizer, it starts to split special token "/s>" as follows: ['▁',…

nlp tokenize huggingface-tokenizers sentencepiece

asked Jul 30 '22 at 14:13

canP

25
4

1

vote

1 answer

Equivalent to tokenizer() in Transformers 2.5.0?

I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0. # Converting pretrained BERT classification model to regression model # i.e. extracting base model and swapping out…

pytorch tokenize huggingface-transformers bert-language-model huggingface-tokenizers

asked Jul 26 '22 at 16:55

galactic_tok

13
5

1

vote

1 answer

issue when importing BloomTokenizer from transformers in python

I am trying to import BloomTokenizer from transformers from transformers import BloomTokenizer and I receive the following error Traceback (most recent call last): File "", line 1, in ImportError: cannot import name…

python nlp huggingface-transformers huggingface-tokenizers huggingface

asked Jul 25 '22 at 10:27

Marcel

167
2
11

1

vote

1 answer

How to preserve the original columns of a dataset when using Huggingface tokenizer?

When using Huggingface Tokenizer with return_overflowing_tokens=True, the results can have multiple token sequence per input string. Therefore, when doing a Dataset.map from strings to token sequence, you need to remove the original columns (as…

huggingface-tokenizers huggingface-datasets huggingface

asked Jul 18 '22 at 02:07

SRobertJames

8,210
14
60
107

1

vote

1 answer

huggingface longformer case sensitive tokenizer

This page shows how to build a longformer based classification. import pandas as pd import datasets from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig import torch.nn…

python nlp huggingface-transformers text-classification huggingface-tokenizers

asked May 23 '22 at 23:15

user2543622

5,760
25
91
159

1

vote

1 answer

Hugginface Transformers Bert Tokenizer - Find out which documents get truncated

I am using the Transforms library from Huggingface to create a text classification model based on Bert. For this I tokenise my documents and I set truncation to be true as my documents are longer than allowed (512). How can I find out how many…

python machine-learning huggingface-transformers huggingface-tokenizers huggingface

asked May 16 '22 at 15:12

Ethan Van den Bleeken

368
1
10

1

vote

1 answer

How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?

Say, I have three sample sentences: s0 = "This model was pretrained using a specific normalization pipeline available here!" s1 = "Thank to all the people around," s2 = "Bengali Mask Language Model for Bengali Language" I could make a batch…

huggingface-transformers bert-language-model huggingface-tokenizers huggingface

asked May 15 '22 at 09:44

Abu Ubaida

95
6

1

vote

2 answers

Huggingface pretrained model's tokenizer and model objects have different maximum input length

I'm using symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli pretrained model from huggingface. My task requires to use it on pretty large texts, so it's essential to know maximum input length. The following code is supposed to load pretrained model…

nlp huggingface-transformers huggingface-tokenizers sentence-transformers

asked Mar 31 '22 at 10:49

Nick Zorander

131
12

1

vote

3 answers

Huggingface sagemaker

I am trying to use the text2text (translation) model facebook/m2m100_418M to run on sagemaker. So if you click on deploy and then sagemaker there is some boilerplate code that works well but I can't seem to find how to pass it the arguments…

python artificial-intelligence amazon-sagemaker huggingface-transformers huggingface-tokenizers

asked Mar 16 '22 at 07:09

Felix Verhulst

43
7

Questions tagged [huggingface-tokenizers]