Questions tagged [language-model]

266 questions
1
vote
0 answers

Using theano to implement maximum likelihood learning in neural probability language model Python

I'm trying to implement maximum likelihood learning for neural probability language model in python from code of log-bilinear model: https://github.com/wenjieguan/Log-bilinear-language-models/blob/master/lbl.py I used grad function in theano to…
kidstar
  • 41
  • 3
0
votes
0 answers

Why do unmasked tokens of a sequence change when passed through a language model?

Why passing a sequence of tokens, say ["A", "B", "C", "D"] through a masked language model without any masking does not result in the same sequence being output when you select the highest probability tokens from the output model logits, i.e.,…
Anshul
  • 61
  • 1
  • 8
0
votes
0 answers

How to vectorize text data in Pandas.DataFrame and then one_hot encoode it "inside" the model

I try to implement sequence model (trained to predict next word) built on one-hot encoded vector sequences. My custom one-hot encoder works well. But just as exercise I want to do all things with tensorflow (inspired by Deep Learning with Python,…
x3mEr
  • 23
  • 6
0
votes
0 answers

With a HuggingFace trainer, how do I show the training loss versus the eval data set?

I'm running: #original training script trainer = transformers.Trainer( model=model, train_dataset=train_dataset, eval_dataset=test_dataset, #turn on the eval dataset for comparisons args=transformers.TrainingArguments( …
0
votes
0 answers

How to train KenLM language model for Nvidia's QuartzNet?

I am trying to train a speech-to-text model for the Armenian language. After I am using the Nvidia NeMo toolkit. After training the acoustic model I used provided NeMo/scripts/asr_language_modeling/ngram_lm/train_kenlm.py file to train the language…
arm
  • 56
  • 1
  • 12
0
votes
0 answers

Python-based way to extract text from scientific/academic paper for a language model

I am looking for a method to extract only the core text of a scientific paper. The paper is structured in paragraphs and I only want to cover the text without any mail-adress, websites, tables or pictures. My purpose is to create a clean txt file…
0
votes
1 answer

How to get the embedding of any vocabulary token in GPT?

I have a GPT model model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device) When I send my batch to it I can get the logits and the hidden states: out = model(batch["input_ids"].to(device), output_hidden_states=True,…
0
votes
1 answer

How to get the vector embedding of a token in GPT?

I have a GPT model model = BioGptForCausalLM.from_pretrained("microsoft/biogpt").to(device) When I send my batch to it I can get the logits and the hidden states: out = model(batch["input_ids"].to(device), output_hidden_states=True,…
0
votes
0 answers

How to use a biomedical model from Huggingface to get text embeddings?

I have biomedical text that I'm trying to get the embeddings for using a biomedical transformer: my_text = ["Chocolate has a history of human consumption tracing back to 400 AD and is rich in polyphenols such as catechins, anthocyanidins, and pro…
0
votes
0 answers

How to train a language model in Huggingface with a custom loss?

I'm following Huggingface's tutorial on training a causal language model. I want to modify it such that instead of just predicting the next token, the model is also predicting a vector after some tokens corresponding to the sentiment. So for…
0
votes
1 answer

Error while installing lmql[hf] using pip: "No matching distribution found for lmql[hf]

I am trying to install lmql[hf] using the pip package manager in order to set up a local LMQL playground. Following the documentation, I ran the command pip install lmql[hf]. However, I encountered the following error: ERROR: Ignored the following…
0
votes
1 answer

ArrowInvalid: Column 4 named input_ids expected length 1000 but got length 328

# Formatting block_size = 128 # or any number suitable to your context def group_texts(examples): # Concatenate all 'input_ids' concatenated_examples = sum(examples["input_ids"], []) total_length = len(concatenated_examples) #…
0
votes
0 answers

How do I do vector embedding of words using Ruby, without making calls to a third party API?

How do I make vector embeddings of words using Ruby, without making calls to a third party API? Just want to do it locally for speed and cost. I can't find any good examples in Ruby.
Some Guy
  • 12,768
  • 22
  • 58
  • 86
0
votes
0 answers

How to compute a simple maximum likelihood LM with SRILM

I want to use build a simple maximum likelihood (i.e. p(w|w_history) = c(w_history, w)/c(w_history), nothing else) language model without any tricks like smoothing. I am using a small corpus on purpose, to check that the computed numbers match with…
peer
  • 4,171
  • 8
  • 42
  • 73
0
votes
1 answer

How to denoise text using T5?

I'm trying to denoise text using a T5 model following the Huggingface doc: from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained("t5-small") model =…
Penguin
  • 1,923
  • 3
  • 21
  • 51