Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
2
votes
1 answer

HuggingFace-Transformers --- NER single sentence/sample prediction

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part). I am following this exact tutorial here :…
2
votes
0 answers

Manually padding a list of BatchEncodings using huggingface's tokenizer

I am having difficulties understanding the tokenizer.pad method from the huggingface transformers library. In order to optimize training, I am performing tokenization in the Dataset such that no complicated operations are performed during data…
2
votes
0 answers

Train BERT model from scratch on a different language

First i create tokenizer as follow from tokenizers import Tokenizer from tokenizers.models import BPE,WordPiece tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) from tokenizers.trainers import BpeTrainer,WordPieceTrainer trainer =…
2
votes
1 answer

Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?

I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset. I wanted to test how well it works on a similar dataset (also on sentiment analysis), so…
2
votes
1 answer

Are these normal speed of Bert Pretrained Model Inference in PyTorch

I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1: 1) bert-base-uncased: 154ms per request 2) bert-base-uncased with quantifization: 94ms per request 3) distilbert-base-uncased: 86ms per…
2
votes
0 answers

huggingface pipeline: bert NER task throws RuntimeError: The size of tensor a (921) must match the size of tensor b (512) at non-singleton dimension 1

I try to set up a german ner, pretrained with bert via the huggingface pipeline. For some texts the following code throws an error "RuntimeError: The size of tensor a (921) must match the size of tensor b (512) at non-singleton dimension 1" for the…
2
votes
0 answers

How to find back the architecture of a pytorch model having only the weight dictionnary?

I wanted to use the multilingual-codesearch model but first the code doesn't work and outputs the following error which suggest that it cannot load with only weights: from transformers import AutoTokenizer, AutoModel tokenizer =…
2
votes
0 answers

Huggingface transformer export tokenizer and model

I'm currently working on a text summarizer powered by the Huggingface transformers library. The summarization process has to be done on premise, as such I have the following code (close to documentation): from transformers import BartTokenizer,…
2
votes
2 answers

AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'

I am just using the huggingface transformer library and get the following message when running run_lm_finetuning.py: AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'. Anyone else with this problem or an idea how to fix it?…
2
votes
1 answer

Is there a way to use Huggingface pretrained tokenizer with wordpiece prefix?

I'm doing a sequence labeling task with Bert. In order to align the word pieces with labels, I need the some marker to identify them so I can get an single embedding for each word by either summing or averaging. For example I want the word New~york…
ashered
  • 79
  • 7
2
votes
0 answers

PEGASUS pre-training for summarisation tasks

I am unsure of how the evaluation for large document summarisation is conducted for the recently introduced PEGASUS model for single document summarisation. The author's show evaluation against large document datasets like Big Patent, PubMed etc…
2
votes
1 answer

Tokenizing & encoding dataset uses too much RAM

Trying to tokenize and encode data to feed to a neural network. I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available…
2
votes
1 answer

Applying pre trained facebook/bart-large-cnn for text summarization in python

I am in a situation where I am working with huggingface transformers and have got some insights into it. I am working with the facebook/bart-large-cnn model to perform text summarisation for my project and I am using the following code as of now to…
2
votes
1 answer

Running huggingface Bert tokenizer on GPU

I'm dealing with a huge text dataset for content classification. I've implemented the distilbert model and distilberttokenizer.from_pretrained() tokenizer.. This tokenizer is taking incredibly long to tokenizer my text data roughly 7 mins for just…
2
votes
1 answer

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em, bed ,ding, s. I…