Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
4
votes
1 answer

OSError: Can't load tokenizer

I want to train an XLNET language model from scratch. First, I have trained a tokenizer as follows: from tokenizers import ByteLevelBPETokenizer # Initialize a tokenizer tokenizer = ByteLevelBPETokenizer() # Customize…
user14251114
4
votes
1 answer

HuggingFace Bert Sentiment analysis

I am getting the following error : AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)., when I run classifier(encoded). My text type is…
4
votes
3 answers

XLNetTokenizer requires the SentencePiece library but it was not found in your environment

I am trying to implement the XLNET on Google Collaboratory. But I get the following issue. ImportError: XLNetTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the installation page…
4
votes
1 answer

Loading saved NER back into HuggingFace pipeline?

I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their…
3
votes
3 answers

How does one set the pad token correctly (not to eos) during fine-tuning to avoid model not predicting EOS?

**tldr; what I really want to know is what is the official way to set pad token for fine tuning it wasn't set during original training, so that it doesn't not learn to predict EOS. ** colab:…
3
votes
1 answer

Using a custom trained huggingface tokenizer

I’ve trained a custom tokenizer using a custom dataset using this code that’s on the documentation. Is there a method for me to add this tokenizer to the hub and to use it as the other tokenizers by calling the AutoTokenizer.from_pretrained()…
3
votes
1 answer

How to load a WordLevel Tokenizer trained with tokenizers in transformers

I would like to use WordLevel encoding method to establish my own wordlists, and it saves the model with a vocab.json under the my_word2_token folder. The code is below and it works. import pandas as pd from tokenizers import decoders, models,…
3
votes
1 answer

How to split input text into equal size of tokens, not character length, and then concatenate the summarization results for Hugging Face transformers

I am using the below methodology to summarize longer than 1024 token size long texts. Current method splits the text by half. I took this from another user's post and modified it slightly. So what I want to do is, instead of splitting into half,…
3
votes
1 answer

Merge multiple BatchEncoding or create tensorflow dataset from list of BatchEncoding objects

In a token labelling task I am using a transformers tokenizer, which outputs objects of the BatchEncoding class. I am tokenizing each text separately because I need to extract the labels from the text and re-arrange them after tokenizing (due to…
3
votes
1 answer

resize_token_embeddings on the a pertrained model with different embedding size

I would like to ask about the way to change the embedding size of the trained model. I have a trained model models/BERT-pretrain-1-step-5000.pkl. Now I am adding a new token [TRA]to the tokeniser and try to use the resize_token_embeddings to the…
3
votes
2 answers

Why does tokeniser break down words that are present in vocab

In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() : def checkModel(model): tokenizer =…
3
votes
0 answers

Create custom data_collator for Huggingface Trainer

I need to create a custom data_collator for finetuning with Huggingface Trainer API. HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given probability. model_ckpt =…
3
votes
1 answer

TypeError: not a string | parameters in AutoTokenizer.from_pretrained()

Goal: Amend this Notebook to work with albert-base-v2 model. Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory. In order to evaluate and to export this Quantised model, I need to setup a…
3
votes
1 answer

HuggingFace AutoTokenizer | ValueError: Couldn't instantiate the backend tokenizer

Goal: Amend this Notebook to work with albert-base-v2 model Error occurs in Section 1.3. Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory. There are 3 listed ways this error can be caused. I'm not…
3
votes
1 answer

How to avoid huggingface t5-based seq to seq suddenly reaching a loss of `nan` and start predicting only ``?

I'm trying to train a t5 based LM head model (mrm8488/t5-base-finetuned-wikiSQL) using my custom data to turn text into SQL (based roughly on the SPIDER dataset). The current training loop I have is something like this: parameters =…