Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
4
votes
1 answer
OSError: Can't load tokenizer
I want to train an XLNET language model from scratch. First, I have trained a tokenizer as follows:
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize…
user14251114
4
votes
1 answer
HuggingFace Bert Sentiment analysis
I am getting the following error :
AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)., when I run classifier(encoded). My text type is…

paris
- 43
- 1
- 1
- 4
4
votes
3 answers
XLNetTokenizer requires the SentencePiece library but it was not found in your environment
I am trying to implement the XLNET on Google Collaboratory. But I get the following issue.
ImportError:
XLNetTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page…

Ashok Kumar Jayaraman
- 2,887
- 2
- 32
- 40
4
votes
1 answer
Loading saved NER back into HuggingFace pipeline?
I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their…

rmahesh
- 739
- 2
- 14
- 30
3
votes
3 answers
How does one set the pad token correctly (not to eos) during fine-tuning to avoid model not predicting EOS?
**tldr; what I really want to know is what is the official way to set pad token for fine tuning it wasn't set during original training, so that it doesn't not learn to predict EOS. **
colab:…

Charlie Parker
- 5,884
- 57
- 198
- 323
3
votes
1 answer
Using a custom trained huggingface tokenizer
I’ve trained a custom tokenizer using a custom dataset using this code that’s on the documentation. Is there a method for me to add this tokenizer to the hub and to use it as the other tokenizers by calling the AutoTokenizer.from_pretrained()…

Dagim Ashenafi
- 33
- 5
3
votes
1 answer
How to load a WordLevel Tokenizer trained with tokenizers in transformers
I would like to use WordLevel encoding method to establish my own wordlists, and it saves the model with a vocab.json under the my_word2_token folder. The code is below and it works.
import pandas as pd
from tokenizers import decoders, models,…

VictorZhu
- 31
- 2
3
votes
1 answer
How to split input text into equal size of tokens, not character length, and then concatenate the summarization results for Hugging Face transformers
I am using the below methodology to summarize longer than 1024 token size long texts.
Current method splits the text by half. I took this from another user's post and modified it slightly.
So what I want to do is, instead of splitting into half,…

Furkan Gözükara
- 22,964
- 77
- 205
- 342
3
votes
1 answer
Merge multiple BatchEncoding or create tensorflow dataset from list of BatchEncoding objects
In a token labelling task I am using a transformers tokenizer, which outputs objects of the BatchEncoding class.
I am tokenizing each text separately because I need to extract the labels from the text and re-arrange them after tokenizing (due to…

raquelhortab
- 430
- 4
- 13
3
votes
1 answer
resize_token_embeddings on the a pertrained model with different embedding size
I would like to ask about the way to change the embedding size of the trained model.
I have a trained model models/BERT-pretrain-1-step-5000.pkl.
Now I am adding a new token [TRA]to the tokeniser and try to use the resize_token_embeddings to the…

tw0930
- 61
- 1
- 5
3
votes
2 answers
Why does tokeniser break down words that are present in vocab
In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() :
def checkModel(model):
tokenizer =…

LSM
- 85
- 1
- 8
3
votes
0 answers
Create custom data_collator for Huggingface Trainer
I need to create a custom data_collator for finetuning with Huggingface Trainer API.
HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given probability.
model_ckpt =…

kkgarg
- 1,246
- 1
- 12
- 28
3
votes
1 answer
TypeError: not a string | parameters in AutoTokenizer.from_pretrained()
Goal: Amend this Notebook to work with albert-base-v2 model.
Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory.
In order to evaluate and to export this Quantised model, I need to setup a…

DanielBell99
- 896
- 5
- 25
- 57
3
votes
1 answer
HuggingFace AutoTokenizer | ValueError: Couldn't instantiate the backend tokenizer
Goal: Amend this Notebook to work with albert-base-v2 model
Error occurs in Section 1.3.
Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory.
There are 3 listed ways this error can be caused. I'm not…

DanielBell99
- 896
- 5
- 25
- 57
3
votes
1 answer
How to avoid huggingface t5-based seq to seq suddenly reaching a loss of `nan` and start predicting only ``?
I'm trying to train a t5 based LM head model (mrm8488/t5-base-finetuned-wikiSQL) using my custom data to turn text into SQL (based roughly on the SPIDER dataset).
The current training loop I have is something like this:
parameters =…

George
- 3,521
- 4
- 30
- 75