0

I'm working with the NLTK module to build a trigram language model and I've noticed some issues. I have noticed that this only starts to occur with trigrams, bigrams and unigrams seem to be safe from it.

I've been using the padded_everygram_pipeline and I feel like that might be the culprit:

# trigram model

import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

f = open("the_medea.txt")
medea = f.read()
f.close()

medea_sents = nltk.sent_tokenize(medea)

medea_sents_tokenized = [nltk.word_tokenize(s) for s in medea_sents]

medea_corpus, vocab = padded_everygram_pipeline(3, medea_sents_tokenized)

lm_medea = MLE(3)

lm_medea.fit(medea_corpus, vocab)

for word in lm_medea.generate(50):
    print(word, end = " ")

#**Output**
#"criticisms one may make some answer . </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>

I had to add the output to the bottom of the code, but it's just a sea of sentence indicators. Any idea why this might be happening?

To clarify as well, this question is very similar to NLTK MLE model clarification trigrams and greater. The reason I marked it as not solving this question is because the answer is outdated and no longer resolves the problem within the model, raising a type error instead of generating unique trigram based language.

  • Does this answer your question? [NLTK MLE model clarification trigrams and greater](https://stackoverflow.com/questions/60295058/nltk-mle-model-clarification-trigrams-and-greater) – dx2-66 Sep 15 '22 at 10:42
  • share the text file as well – atinjanki Sep 15 '22 at 14:05

0 Answers0