Highest Voted 'huggingface-tokenizers' Questions

3

votes

2 answers

Tokenizers change vocabulary entry

I have some text which I want to perform NLP on. To do so, I download a pre-trained tokenizer like so: import transformers as ts pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp') Then I create my own…

asked Oct 30 '21 at 18:02

user9102437

600
1
10
24

3

votes

1 answer

How to train a tokenizer on a big dataset?

Based on examples, I am trying to train a tokenizer and a model for T5 for Persian. I use Google Colab pro, when I tried to run the following code: import datasets from t5_tokenizer_model import SentencePieceUnigramTokenizer vocab_size =…

python huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Oct 02 '21 at 13:01

Ahmad

8,811
11
76
141

3

votes

1 answer

Using Hugging-face transformer with arguments in pipeline

I am working on using a transformer. Pipeline to get BERT embeddings to my input. using this without a pipeline i am able to get constant outputs but not with pipeline since I was not able to pass arguments to it. How can I pass transformer-related…

pytorch huggingface-transformers bert-language-model transformer-model huggingface-tokenizers

asked Sep 15 '21 at 16:47

Israel-abebe

538
1
4
20

3

votes

3 answers

HUGGINGFACE TypeError: '>' not supported between instances of 'NoneType' and 'int'

I am working on Fine-Tuning Pretrained Model on custom (using HuggingFace) dataset I will copy all code correctly from the one youtube video everything is ok but in this cell/code: with training_args.strategy.scope(): …

deep-learning data-science huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Aug 21 '21 at 17:54

Shahid Khan

103
1
2
9

3

votes

1 answer

Can't load transformers models

I have the following problem to load a transformer model. The strange thing is that it work on google colab or even when I tried on another computer, it seems to be version / cache problem but I didn't found it. from sentence_transformers import…

python huggingface-transformers huggingface-tokenizers

asked Jul 26 '21 at 10:21

David Rouyre

68
1
6

3

votes

1 answer

Strange results with huggingface transformer[marianmt] translation of larger text

I need to translate large amounts of text from a database. Therefore, I've been dealing with transformers and models for a few days. I'm absolutely no data science expert and unfortunately I don't get any further. The problem starts with longer…

python translation huggingface-transformers huggingface-tokenizers

asked Jun 29 '21 at 20:10

TefoD

177
10

3

votes

1 answer

Transformers tokenizer returns overlapping tokens. Is that a bug or am I doing something wrong?

I have been trying to do some token classification using huggingface transformers. I'm seeing instances where the tokenizer returns overlapping tokens. Sometimes (but not always) this will result in the model giving me an entity such that the…

huggingface-transformers huggingface-tokenizers

asked Jun 22 '21 at 22:54

kera

46
4

3

votes

0 answers

Model overfits after first epoch

I'm trying to use hugging face's BERT-base-uncased model to train on emoji prediction on tweets, and it seems that after the first epoch, the model immediately starts to overfit. I have tried the following: Increasing the training data (I increased…

pytorch huggingface-transformers huggingface-tokenizers

asked Jun 15 '21 at 23:38

Ali Abbas

136
1
8

3

votes

1 answer

BERT - Is that needed to add new tokens to be trained in a domain specific environment?

My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that. The thing is, am I supposed to add the domain-specific tokens before the MLM training, or I just let Bert figure out the context?…

nlp bert-language-model huggingface-transformers huggingface-tokenizers

asked Apr 12 '21 at 12:51

rdemorais

243
3
11

3

votes

1 answer

Huggingface error: AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'

I am trying to tokenize some numerical strings using a WordLevel/BPE tokenizer, create a data collator and eventually use it in a PyTorch DataLoader to train a new model from scratch. However, I am getting an error AttributeError:…

python pytorch tokenize huggingface-transformers huggingface-tokenizers

asked Mar 26 '21 at 22:20

Athena Wisdom

6,101
9
36
60

3

votes

1 answer

How to map token indices from the SQuAD data to tokens from BERT tokenizer?

I am using the SQuaD dataset for answer span selection. After using the BertTokenizer to tokenize the passages, for some samples, the start and end indices of the answer don't match the real answer span position in the passage tokens anymore. How to…

bert-language-model transformer-model nlp-question-answering huggingface-tokenizers squad

asked Mar 17 '21 at 03:21

KoalaJ

145
2
11

3

votes

0 answers

Enabling truncation in transformers feature extraction pipeline

I'm using the transformers FeatureExtractionPipeline like this: from transformers import pipeline, LongformerTokenizer, LongformerModel tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096') model =…

huggingface-transformers huggingface-tokenizers

asked Feb 19 '21 at 23:53

Lukas Tilmann

73
1
5

3

votes

0 answers

Should I create a PyTorch Dataset to train a model off a pyspark dataframe?

I want to train a PyTorch NLP model over training data in columnar format, and I thought to construct a PyTorch Dataset using as raw data a pyspark dataframe (not sure it's the right approach...). To preprocess text I'm using a tokenizer provided by…

python pyspark pytorch huggingface-tokenizers petastorm

asked Feb 10 '21 at 14:34

Davide Fiocco

5,350
5
35
72

3

votes

1 answer

Huggingface MarianMT translators lose content, depending on the model

Context I am using MarianMT von Huggingface via Python in order to translate text from a source to a target language. Expected behaviour I enter a sequence into the MarianMT model and get this sequence translated back. For this, I use a…

python huggingface-transformers huggingface-tokenizers machine-translation

asked Dec 22 '20 at 12:28

Lorento Palanomi

49
6

3

votes

2 answers

BPE multiple ways to encode a word

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", "ke", "en"). Then the word "token" could be encoded as ("to",…

merge nlp tokenize bert-language-model huggingface-tokenizers

asked Aug 05 '20 at 11:07

SweetSpot

101
2

Questions tagged [huggingface-tokenizers]