Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

451 questions
0
votes
2 answers

How to convert word to numerics using huggingface or spacy or any python based workflow

I have lot of text which has the counting in words as well in different languages (different datasets but one data has one language so no mixing of language). like I have one apple I have two kids and I want it to convert as I have 1 apple I have…
0
votes
1 answer

Is it possible to see all the token rankings for masked language modelling?

I was just wondering whether it would be possible to see all the predicted tokens for masked language modelling? Specifically, all the tokens with a low probability. For example, consider this masked language model: unmasker("I am feeling
0
votes
1 answer

Unable to push pre-trained model from Google colab to Huggingface for hosting Bot

I have trained a Chatbot model in Google collab and when pushing to Huggingface, it doesn't push and the notebook keeps executing and doesn't push, the size is around 500MB !sudo apt-get install git-lfs !git config --global user.email "MY…
0
votes
2 answers

Problem with batch_encode_plus method of tokenizer

I am encountering a strange issue in the batch_encode_plus method of the tokenizers. I have recently switched from transformer version 3.3.0 to 4.5.1. (I am creating my databunch for NER). I have 2 sentences whom I need to encode, and I have a case…
0
votes
1 answer

AttributeError: type object 'Wav2Vec2ForCTC' has no attribute 'from_pretrained'

I am trying to fine tune Wav2Vec2 model for medical vocabulary. When I try to run the following code on my VS Code Jupyter notebook, I am getting an error, but when I run the same thing on Google Colab, it works fine. from transformers import…
0
votes
1 answer

DataCollatorForMultipleChoice gives KeyError: 'labels' in trainer.train

I am working on multiple-choice QA. I am using the official notebook of huggingface/transformers which is implemented for SWAG dataset. I want to use it for other multiple-choice datasets. Therefore, I add some modifications related to dataset. all…
0
votes
1 answer

Hugging face tokenizer cannot load files properly

I am trying to train a translation model from sratch using HuggingFace's BartModel architecture. I am using a ByteLevelBPETokenizer to tokenize things. The issue that I am facing is that when I save the tokenizer after training it is not loaded…
0
votes
0 answers

I'm facing BrokenPipeError when I'm trying to run sentiment analysis with hugging face

I'm facing BrokenPipeError when I'm trying to run sentiment analysis with hugging face. It's returning [Error No] 32 Broken Pipe. Link with total code 'https://colab.research.google.com/drive/1wBXKa-gkbSPPk-o7XdwixcGk7gSHRMas?usp=sharing' The code…
0
votes
1 answer

TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]

I'm having issues applying a Regex expression to a Split() operation found in the HuggingFace Library. The library requests the following input for Split(). pattern (str or Regex) – A pattern used to split the string. Usually a string or a…
Jamie Dimon
  • 467
  • 4
  • 16
0
votes
2 answers

In HuggingFace tokenizers: how can I split a sequence simply on spaces?

I am using DistilBertTokenizer tokenizer from HuggingFace. I would like to tokenize my text by simple splitting it on space: ["Don't", "you", "love", "", "Transformers?", "We", "sure", "do."] instead of the default behavior, which is like…
0
votes
1 answer

Encoding error: Train BERT from scratch in Vietnamese language

I follow this tutorial How to train a new language model from scratch using Transformers and Tokenizers. In Section 2. Train a tokenizer, after training by my own Vietnamese text data, I look at the .vocab file generated, all the tokens become like…
0
votes
2 answers

TFGPT2LMHeadModel unknown location

I have been playing around with tensorflow (CPU), and some language model'ing - and it have been a blast so far - everything working great. But after watching my old CPU slowly getting killed from all the model-training - i decided it was time to …
0
votes
0 answers

Is there a way to use GPU instead of CPU for BERT tokenization?

I'm using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words): #creating a BERT tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', …
0
votes
1 answer

Translating using pre-trained hugging face transformers not working

I have a situation where I am trying to using the pre-trained hugging-face models to translate a pandas column of text from Dutch to English. My input is simple: Dutch_text Hallo, het gaat goed Hallo, ik ben niet in orde Stackoverflow…
0
votes
1 answer

I want to use "grouped_entities" in the huggingface pipeline for ner task, how to do that?

I want to use "grouped_entities" in the huggingface pipeline for ner task. However having issues doing that. I do look the following link on git but this did not help: https://github.com/huggingface/transformers/pull/4987
1 2 3
29
30