Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
3
votes
2 answers

Tokenizers change vocabulary entry

I have some text which I want to perform NLP on. To do so, I download a pre-trained tokenizer like so: import transformers as ts pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp') Then I create my own…
3
votes
1 answer

How to train a tokenizer on a big dataset?

Based on examples, I am trying to train a tokenizer and a model for T5 for Persian. I use Google Colab pro, when I tried to run the following code: import datasets from t5_tokenizer_model import SentencePieceUnigramTokenizer vocab_size =…
3
votes
1 answer

Using Hugging-face transformer with arguments in pipeline

I am working on using a transformer. Pipeline to get BERT embeddings to my input. using this without a pipeline i am able to get constant outputs but not with pipeline since I was not able to pass arguments to it. How can I pass transformer-related…
3
votes
3 answers

HUGGINGFACE TypeError: '>' not supported between instances of 'NoneType' and 'int'

I am working on Fine-Tuning Pretrained Model on custom (using HuggingFace) dataset I will copy all code correctly from the one youtube video everything is ok but in this cell/code: with training_args.strategy.scope(): …
3
votes
1 answer

Can't load transformers models

I have the following problem to load a transformer model. The strange thing is that it work on google colab or even when I tried on another computer, it seems to be version / cache problem but I didn't found it. from sentence_transformers import…
3
votes
1 answer

Strange results with huggingface transformer[marianmt] translation of larger text

I need to translate large amounts of text from a database. Therefore, I've been dealing with transformers and models for a few days. I'm absolutely no data science expert and unfortunately I don't get any further. The problem starts with longer…
3
votes
1 answer

Transformers tokenizer returns overlapping tokens. Is that a bug or am I doing something wrong?

I have been trying to do some token classification using huggingface transformers. I'm seeing instances where the tokenizer returns overlapping tokens. Sometimes (but not always) this will result in the model giving me an entity such that the…
kera
  • 46
  • 4
3
votes
0 answers

Model overfits after first epoch

I'm trying to use hugging face's BERT-base-uncased model to train on emoji prediction on tweets, and it seems that after the first epoch, the model immediately starts to overfit. I have tried the following: Increasing the training data (I increased…
3
votes
1 answer

BERT - Is that needed to add new tokens to be trained in a domain specific environment?

My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that. The thing is, am I supposed to add the domain-specific tokens before the MLM training, or I just let Bert figure out the context?…
3
votes
1 answer

Huggingface error: AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'

I am trying to tokenize some numerical strings using a WordLevel/BPE tokenizer, create a data collator and eventually use it in a PyTorch DataLoader to train a new model from scratch. However, I am getting an error AttributeError:…
3
votes
1 answer

How to map token indices from the SQuAD data to tokens from BERT tokenizer?

I am using the SQuaD dataset for answer span selection. After using the BertTokenizer to tokenize the passages, for some samples, the start and end indices of the answer don't match the real answer span position in the passage tokens anymore. How to…
3
votes
0 answers

Enabling truncation in transformers feature extraction pipeline

I'm using the transformers FeatureExtractionPipeline like this: from transformers import pipeline, LongformerTokenizer, LongformerModel tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096') model =…
3
votes
0 answers

Should I create a PyTorch Dataset to train a model off a pyspark dataframe?

I want to train a PyTorch NLP model over training data in columnar format, and I thought to construct a PyTorch Dataset using as raw data a pyspark dataframe (not sure it's the right approach...). To preprocess text I'm using a tokenizer provided by…
Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
3
votes
1 answer

Huggingface MarianMT translators lose content, depending on the model

Context I am using MarianMT von Huggingface via Python in order to translate text from a source to a target language. Expected behaviour I enter a sequence into the MarianMT model and get this sequence translated back. For this, I use a…
3
votes
2 answers

BPE multiple ways to encode a word

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", "ke", "en"). Then the word "token" could be encoded as ("to",…