Highest Voted 'huggingface-tokenizers' Questions

2

votes

1 answer

Parsing the Hugging Face Transformer Output

I am looking to use bert-english-uncased-finetuned-pos transformer, mentioned here https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos?text=My+name+is+Clara+and+I+live+in+Berkeley%2C+California. I am querying the transformer this…

huggingface-transformers huggingface-tokenizers

asked Dec 23 '20 at 05:55

red-devil

1,064
1
20
34

2

votes

0 answers

Finetune TFBertForMaskedLM model.fit() ValueError

The Problem I have been trying to train TFBertForMaskedLM model with tensorflow. But when i use model.fit() always encounter some question.Hope someone can help and propose some solution. Reference Paper and sample output The Paper title is…

tensorflow2.0 huggingface-transformers huggingface-tokenizers

asked Nov 13 '20 at 04:25

hueiyuan su

21
1

2

votes

0 answers

Iterating through Huggingface tokenizer with remainder

Transformer models have maximum token limits. If I want to substring my text to fit within that limit, what is the generally accepted way? Due to the treatment of special characters, it isn't the case that the tokenizer maps its tokens to something…

huggingface-tokenizers

asked Oct 05 '20 at 22:05

Mittenchops

18,633
33
128
246

2

votes

3 answers

Any reason to save a pretrained BERT tokenizer?

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer.encode() I have seen in most places that people…

save pytorch bert-language-model huggingface-tokenizers

asked Sep 22 '20 at 22:54

ginobimura

115
1
5

2

votes

1 answer

GPT2 on Hugging face(pytorch transformers) RuntimeError: grad can be implicitly created only for scalar outputs

I am trying to fine-tune gpt2 with a custom dataset of mine. I created a basic example with the documentation from hugging-face transformers. I receive the mentioned error. I know what it means: (basically it is calling backward on a non-scalar…

python nlp pytorch huggingface-transformers huggingface-tokenizers

asked Sep 16 '20 at 16:44

Berkay Berabi

1,933
1
10
26

2

votes

2 answers

Is there a way to get the location of the substring from which a certain token has been produced in BERT?

I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings. string = tokenizer.decode(...) However, the…

tokenize bert-language-model huggingface-transformers huggingface-tokenizers

asked Aug 14 '20 at 13:09

MarciBE

23
3

1

vote

0 answers

Error when running a huggingface model in 4bit mode in Streamlit using bitsnbytes. Quant state is being set to None unwillingly

I am loading a huggingface starchat beta model in streamlit and caching it thus : @st.cache_resource def load_model(): """Initialize the tokenizer and the AI model.""" tokenizer =…

streamlit huggingface huggingface-tokenizers quantization

asked Aug 30 '23 at 13:17

Abhilash Pal

11
2

1

vote

0 answers

layoutlmv3: Issue with postprocess Method Not Returning Data Beyond 512 Tokens Despite Complete Inference

I am facing an issue with my post-processing method. I have a pipeline that involves preprocessing, inference, and post-processing steps. During the preprocessing step, I tokenize the input data and handle token overflow for sequences greater than…

python machine-learning artificial-intelligence huggingface-transformers huggingface-tokenizers

asked Aug 14 '23 at 13:24

j3ws3r

11
1

1

vote

0 answers

Failing to import transformers.models.t5.modeling_flax_t5

The following error occurs: RuntimeError: Failed to import transformers.models.t5.modeling_flax_t5 because of the following error (look up to see its traceback): module 'jax.numpy' has no attribute 'DeviceArray' when I try to run this command on…

python-3.x pytorch nlp huggingface-transformers huggingface-tokenizers

asked Aug 04 '23 at 20:03

fried_carriots

11
2

1

vote

0 answers

Why T5 can only generate sentences of length 20. Can someone help me? I wish I could generate longer sentences

from datasets import load_dataset books = load_dataset('higashi1/mymulti30k', "en-de") from transformers import AutoTokenizer #checkpoint = "./logs/" checkpoint = "t5-base" tokenizer =…

huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Jul 24 '23 at 01:52

HIKARI

11
1

1

vote

1 answer

Huggingface: How do I find the max length of a model?

Given a transformer model on huggingface, how do I find the maximum input sequence length? For example, here I want to truncate to the max_length of the model: tokenizer(examples["text"], padding="max_length", truncation=True) How do I find the…

pytorch huggingface-transformers huggingface huggingface-tokenizers

asked Jun 24 '23 at 18:45

JobHunter69

1,706
5
25
49

1

vote

1 answer

Why is the LayoutLM installation fails to install?

I want to install LayoutLM in Google Colaboratory First, I have cloned the LayoutLM from this GitHub repository https://github.com/microsoft/unilm.git After that, I will install the LayoutLM by running its setup.py file by running this code…

python installation google-colaboratory huggingface-transformers huggingface-tokenizers

asked Jun 19 '23 at 07:46

Scezui

21
4

1

vote

1 answer

Running LLM on a local server

I am new at LLM. I need to run a LLM on a local server and need to download different model to experiment. I am trying to follow this guide from HuggingFace https://huggingface.co/docs/transformers/installation#offline-mode To begin with, I…

huggingface-transformers huggingface-tokenizers large-language-model

asked Jun 17 '23 at 09:57

zoomraider

117
1
9

1

vote

1 answer

How to create huggingface tokenizer from a "char_to_idx" dict?

Given a dictionary char_to_idx how can one create a tokenizer such that the ids of the tokens are guaranteed to be the same as in char_to_idx? char_to_idx = {'a': 0, 'b': 1, 'c': 2, 'd': 3} tokenizer =…

python nlp huggingface-tokenizers

asked Jun 16 '23 at 17:43

Yorai Levi

473
5
17

1

vote

0 answers

Decoded text of huggingface Unigram tokenizer has extra spaces

decoded should be equal to text but: import tokenizers text = "Hello World!" tokenizer = tokenizers.Tokenizer(tokenizers.models.Unigram()) tokenizer.train_from_iterator(text) encoded = tokenizer.encode(text) decoded =…

python nlp huggingface-tokenizers

asked Jun 16 '23 at 17:37

Yorai Levi

473
5
17

Questions tagged [huggingface-tokenizers]