Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
Questions tagged [huggingface-tokenizers]
451 questions
0
votes
2 answers
How to convert word to numerics using huggingface or spacy or any python based workflow
I have lot of text which has the counting in words as well in different languages (different datasets but one data has one language so no mixing of language).
like
I have one apple
I have two kids
and
I want it to convert as
I have 1 apple
I have…

ML85
- 709
- 7
- 19
0
votes
1 answer
Is it possible to see all the token rankings for masked language modelling?
I was just wondering whether it would be possible to see all the predicted tokens for masked language modelling? Specifically, all the tokens with a low probability.
For example, consider this masked language model:
unmasker("I am feeling …
user14946125
0
votes
1 answer
Unable to push pre-trained model from Google colab to Huggingface for hosting Bot
I have trained a Chatbot model in Google collab and when pushing to Huggingface, it doesn't push and the notebook keeps executing and doesn't push, the size is around 500MB
!sudo apt-get install git-lfs
!git config --global user.email "MY…
0
votes
2 answers
Problem with batch_encode_plus method of tokenizer
I am encountering a strange issue in the batch_encode_plus method of the tokenizers. I have recently switched from transformer version 3.3.0 to 4.5.1. (I am creating my databunch for NER).
I have 2 sentences whom I need to encode, and I have a case…

Anurag Sharma
- 4,839
- 13
- 59
- 101
0
votes
1 answer
AttributeError: type object 'Wav2Vec2ForCTC' has no attribute 'from_pretrained'
I am trying to fine tune Wav2Vec2 model for medical vocabulary. When I try to run the following code on my VS Code Jupyter notebook, I am getting an error, but when I run the same thing on Google Colab, it works fine.
from transformers import…

Ayush Mehta
- 3
- 4
0
votes
1 answer
DataCollatorForMultipleChoice gives KeyError: 'labels' in trainer.train
I am working on multiple-choice QA. I am using the official notebook of huggingface/transformers which is implemented for SWAG dataset.
I want to use it for other multiple-choice datasets. Therefore, I add some modifications related to dataset. all…

programming123
- 79
- 1
- 7
0
votes
1 answer
Hugging face tokenizer cannot load files properly
I am trying to train a translation model from sratch using HuggingFace's BartModel architecture. I am using a ByteLevelBPETokenizer to tokenize things.
The issue that I am facing is that when I save the tokenizer after training it is not loaded…

Vaibhav Agrawal
- 37
- 7
0
votes
0 answers
I'm facing BrokenPipeError when I'm trying to run sentiment analysis with hugging face
I'm facing BrokenPipeError when I'm trying to run sentiment analysis with hugging face. It's returning [Error No] 32 Broken Pipe.
Link with total code 'https://colab.research.google.com/drive/1wBXKa-gkbSPPk-o7XdwixcGk7gSHRMas?usp=sharing'
The code…

Nithin Reddy
- 580
- 2
- 8
- 18
0
votes
1 answer
TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]
I'm having issues applying a Regex expression to a Split() operation found in the HuggingFace Library. The library requests the following input for Split().
pattern (str or Regex) – A pattern used to split the string. Usually a
string or a…

Jamie Dimon
- 467
- 4
- 16
0
votes
2 answers
In HuggingFace tokenizers: how can I split a sequence simply on spaces?
I am using DistilBertTokenizer tokenizer from HuggingFace.
I would like to tokenize my text by simple splitting it on space:
["Don't", "you", "love", "", "Transformers?", "We", "sure", "do."]
instead of the default behavior, which is like…

Taras Kucherenko
- 103
- 3
- 10
0
votes
1 answer
Encoding error: Train BERT from scratch in Vietnamese language
I follow this tutorial How to train a new language model from scratch using Transformers and Tokenizers.
In Section 2. Train a tokenizer, after training by my own Vietnamese text data, I look at the .vocab file generated, all the tokens become like…

save_ole
- 300
- 1
- 3
- 10
0
votes
2 answers
TFGPT2LMHeadModel unknown location
I have been playing around with tensorflow (CPU), and some language model'ing - and it have been a blast so far - everything working great.
But after watching my old CPU slowly getting killed from all the model-training - i decided it was time to …

Magnus V.
- 43
- 4
0
votes
0 answers
Is there a way to use GPU instead of CPU for BERT tokenization?
I'm using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words):
#creating a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
…

Vincent Teyssier
- 2,146
- 5
- 25
- 62
0
votes
1 answer
Translating using pre-trained hugging face transformers not working
I have a situation where I am trying to using the pre-trained hugging-face models to translate a pandas column of text from Dutch to English. My input is simple:
Dutch_text
Hallo, het gaat goed
Hallo, ik ben niet in orde
Stackoverflow…

Django0602
- 797
- 7
- 26
0
votes
1 answer
I want to use "grouped_entities" in the huggingface pipeline for ner task, how to do that?
I want to use "grouped_entities" in the huggingface pipeline for ner task. However having issues doing that.
I do look the following link on git but this did not help:
https://github.com/huggingface/transformers/pull/4987

Abhishek Bisht
- 138
- 1
- 10