Highest Voted 'huggingface-tokenizers' Questions

4

votes

1 answer

OSError: Can't load tokenizer

I want to train an XLNET language model from scratch. First, I have trained a tokenizer as follows: from tokenizers import ByteLevelBPETokenizer # Initialize a tokenizer tokenizer = ByteLevelBPETokenizer() # Customize…

python transformer-model huggingface-tokenizers

asked Feb 20 '21 at 15:33

user14251114

4

votes

1 answer

HuggingFace Bert Sentiment analysis

I am getting the following error : AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)., when I run classifier(encoded). My text type is…

python bert-language-model huggingface-transformers huggingface-tokenizers

asked Jan 25 '21 at 09:13

paris

43
1
1
4

4

votes

3 answers

XLNetTokenizer requires the SentencePiece library but it was not found in your environment

I am trying to implement the XLNET on Google Collaboratory. But I get the following issue. ImportError: XLNetTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the installation page…

google-colaboratory huggingface-transformers transformer-model huggingface-tokenizers

asked Jan 04 '21 at 05:09

Ashok Kumar Jayaraman

2,887
2
32
40

4

votes

1 answer

Loading saved NER back into HuggingFace pipeline?

I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their…

nlp named-entity-recognition huggingface-transformers huggingface-tokenizers

asked Sep 28 '20 at 17:18

rmahesh

739
2
14
30

3

votes

3 answers

How does one set the pad token correctly (not to eos) during fine-tuning to avoid model not predicting EOS?

**tldr; what I really want to know is what is the official way to set pad token for fine tuning it wasn't set during original training, so that it doesn't not learn to predict EOS. ** colab:…

machine-learning pytorch huggingface-transformers huggingface huggingface-tokenizers

asked Jul 07 '23 at 01:11

Charlie Parker

5,884
57
198
323

3

votes

1 answer

Using a custom trained huggingface tokenizer

I’ve trained a custom tokenizer using a custom dataset using this code that’s on the documentation. Is there a method for me to add this tokenizer to the hub and to use it as the other tokenizers by calling the AutoTokenizer.from_pretrained()…

python huggingface-transformers huggingface-tokenizers huggingface huggingface-hub

asked Apr 18 '23 at 14:05

Dagim Ashenafi

33
5

3

votes

1 answer

How to load a WordLevel Tokenizer trained with tokenizers in transformers

I would like to use WordLevel encoding method to establish my own wordlists, and it saves the model with a vocab.json under the my_word2_token folder. The code is below and it works. import pandas as pd from tokenizers import decoders, models,…

nlp huggingface-transformers huggingface-tokenizers

asked Apr 11 '23 at 02:52

VictorZhu

31
2

3

votes

1 answer

How to split input text into equal size of tokens, not character length, and then concatenate the summarization results for Hugging Face transformers

I am using the below methodology to summarize longer than 1024 token size long texts. Current method splits the text by half. I took this from another user's post and modified it slightly. So what I want to do is, instead of splitting into half,…

python nlp huggingface-transformers huggingface-tokenizers huggingface

asked Oct 29 '22 at 10:56

Furkan Gözükara

22,964
77
205
342

3

votes

1 answer

Merge multiple BatchEncoding or create tensorflow dataset from list of BatchEncoding objects

In a token labelling task I am using a transformers tokenizer, which outputs objects of the BatchEncoding class. I am tokenizing each text separately because I need to extract the labels from the text and re-arrange them after tokenizing (due to…

python tensorflow tensorflow-datasets huggingface-tokenizers

asked Jul 18 '22 at 15:10

raquelhortab

430
4
13

3

votes

1 answer

resize_token_embeddings on the a pertrained model with different embedding size

I would like to ask about the way to change the embedding size of the trained model. I have a trained model models/BERT-pretrain-1-step-5000.pkl. Now I am adding a new token [TRA]to the tokeniser and try to use the resize_token_embeddings to the…

pytorch huggingface-transformers bert-language-model word-embedding huggingface-tokenizers

asked Jun 27 '22 at 16:38

tw0930

61
1
5

3

votes

2 answers

Why does tokeniser break down words that are present in vocab

In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() : def checkModel(model): tokenizer =…

python python-3.x huggingface-transformers huggingface-tokenizers

asked Apr 30 '22 at 10:53

LSM

85
1
8

3

votes

0 answers

Create custom data_collator for Huggingface Trainer

I need to create a custom data_collator for finetuning with Huggingface Trainer API. HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given probability. model_ckpt =…

python huggingface-transformers bert-language-model huggingface-tokenizers huggingface-datasets

asked Apr 19 '22 at 19:25

kkgarg

1,246
1
12
28

3

votes

1 answer

TypeError: not a string | parameters in AutoTokenizer.from_pretrained()

Goal: Amend this Notebook to work with albert-base-v2 model. Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory. In order to evaluate and to export this Quantised model, I need to setup a…

python tensorflow huggingface-transformers onnx huggingface-tokenizers

asked Jan 14 '22 at 11:00

DanielBell99

896
5
25
57

3

votes

1 answer

HuggingFace AutoTokenizer | ValueError: Couldn't instantiate the backend tokenizer

Goal: Amend this Notebook to work with albert-base-v2 model Error occurs in Section 1.3. Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory. There are 3 listed ways this error can be caused. I'm not…

python tensorflow huggingface-transformers onnx huggingface-tokenizers

asked Jan 13 '22 at 14:37

DanielBell99

896
5
25
57

3

votes

1 answer

How to avoid huggingface t5-based seq to seq suddenly reaching a loss of `nan` and start predicting only ``?

I'm trying to train a t5 based LM head model (mrm8488/t5-base-finetuned-wikiSQL) using my custom data to turn text into SQL (based roughly on the SPIDER dataset). The current training loop I have is something like this: parameters =…

python machine-learning nlp huggingface-transformers huggingface-tokenizers

asked Dec 16 '21 at 05:21

George

3,521
4
30
75

Questions tagged [huggingface-tokenizers]