Highest Voted 'huggingface-tokenizers' Questions

2

votes

1 answer

HuggingFace-Transformers --- NER single sentence/sample prediction

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part). I am following this exact tutorial here :…

asked Aug 25 '21 at 07:56

Timbus Calin

13,809
5
41
59

2

votes

0 answers

Manually padding a list of BatchEncodings using huggingface's tokenizer

I am having difficulties understanding the tokenizer.pad method from the huggingface transformers library. In order to optimize training, I am performing tokenization in the Dataset such that no complicated operations are performed during data…

pytorch huggingface-transformers huggingface-tokenizers

asked Jun 22 '21 at 14:38

Jovan Andonov

436
3
12

2

votes

0 answers

Train BERT model from scratch on a different language

First i create tokenizer as follow from tokenizers import Tokenizer from tokenizers.models import BPE,WordPiece tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) from tokenizers.trainers import BpeTrainer,WordPieceTrainer trainer =…

python bert-language-model huggingface-transformers huggingface-tokenizers

asked Jun 13 '21 at 10:56

Talha Anwar

2,699
4
23
62

2

votes

1 answer

Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?

I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset. I wanted to test how well it works on a similar dataset (also on sentiment analysis), so…

sentiment-analysis bert-language-model huggingface-transformers huggingface-tokenizers

asked May 31 '21 at 09:33

Wouter S

23
4

2

votes

1 answer

Are these normal speed of Bert Pretrained Model Inference in PyTorch

I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1: 1) bert-base-uncased: 154ms per request 2) bert-base-uncased with quantifization: 94ms per request 3) distilbert-base-uncased: 86ms per…

bert-language-model huggingface-transformers transformer-model huggingface-tokenizers

asked May 26 '21 at 06:07

marlon

6,029
8
42
76

2

votes

0 answers

huggingface pipeline: bert NER task throws RuntimeError: The size of tensor a (921) must match the size of tensor b (512) at non-singleton dimension 1

I try to set up a german ner, pretrained with bert via the huggingface pipeline. For some texts the following code throws an error "RuntimeError: The size of tensor a (921) must match the size of tensor b (512) at non-singleton dimension 1" for the…

python bert-language-model huggingface-transformers named-entity-recognition huggingface-tokenizers

asked May 05 '21 at 22:57

Michael Göggelmann

71
9

2

votes

0 answers

How to find back the architecture of a pytorch model having only the weight dictionnary?

I wanted to use the multilingual-codesearch model but first the code doesn't work and outputs the following error which suggest that it cannot load with only weights: from transformers import AutoTokenizer, AutoModel tokenizer =…

pytorch huggingface-transformers huggingface-tokenizers state-dict

asked May 01 '21 at 06:28

AtonKamanda

21
3

2

votes

0 answers

Huggingface transformer export tokenizer and model

I'm currently working on a text summarizer powered by the Huggingface transformers library. The summarization process has to be done on premise, as such I have the following code (close to documentation): from transformers import BartTokenizer,…

huggingface-transformers huggingface-tokenizers

asked Apr 26 '21 at 15:02

ovesco

41
3

2

votes

2 answers

AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'

I am just using the huggingface transformer library and get the following message when running run_lm_finetuning.py: AttributeError: 'GPT2TokenizerFast' object has no attribute 'max_len'. Anyone else with this problem or an idea how to fix it?…

tokenize huggingface-transformers transformer-model huggingface-tokenizers gpt-2

asked Apr 14 '21 at 10:20

m.b

45
1
4

2

votes

1 answer

Is there a way to use Huggingface pretrained tokenizer with wordpiece prefix?

I'm doing a sequence labeling task with Bert. In order to align the word pieces with labels, I need the some marker to identify them so I can get an single embedding for each word by either summing or averaging. For example I want the word New~york…

huggingface-tokenizers

asked Apr 09 '21 at 18:47

ashered

79
7

2

votes

0 answers

PEGASUS pre-training for summarisation tasks

I am unsure of how the evaluation for large document summarisation is conducted for the recently introduced PEGASUS model for single document summarisation. The author's show evaluation against large document datasets like Big Patent, PubMed etc…

nlp huggingface-transformers transformer-model summarization huggingface-tokenizers

asked Mar 30 '21 at 08:20

calveeen

621
2
10
28

2

votes

1 answer

Tokenizing & encoding dataset uses too much RAM

Trying to tokenize and encode data to feed to a neural network. I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available…

python nlp pytorch huggingface-transformers huggingface-tokenizers

asked Mar 22 '21 at 14:23

Exa

466
3
16

2

votes

1 answer

Applying pre trained facebook/bart-large-cnn for text summarization in python

I am in a situation where I am working with huggingface transformers and have got some insights into it. I am working with the facebook/bart-large-cnn model to perform text summarisation for my project and I am using the following code as of now to…

python-3.x nlp huggingface-transformers summarization huggingface-tokenizers

asked Feb 25 '21 at 16:41

Django0602

797
7
26

2

votes

1 answer

Running huggingface Bert tokenizer on GPU

I'm dealing with a huge text dataset for content classification. I've implemented the distilbert model and distilberttokenizer.from_pretrained() tokenizer.. This tokenizer is taking incredibly long to tokenizer my text data roughly 7 mins for just…

deep-learning nlp huggingface-transformers huggingface-tokenizers

asked Feb 08 '21 at 06:16

tehem

45
1
5

2

votes

1 answer

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em, bed ,ding, s. I…

nlp huggingface-transformers transformer-model huggingface-tokenizers

asked Jan 13 '21 at 06:51

cerofrais

1,117
1
12
32

Questions tagged [huggingface-tokenizers]