I'm working with the NLTK module to build a trigram language model and I've noticed some issues. I have noticed that this only starts to occur with trigrams, bigrams and unigrams seem to be safe from it.
I've been using the padded_everygram_pipeline and I feel like that might be the culprit:
# trigram model
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
f = open("the_medea.txt")
medea = f.read()
f.close()
medea_sents = nltk.sent_tokenize(medea)
medea_sents_tokenized = [nltk.word_tokenize(s) for s in medea_sents]
medea_corpus, vocab = padded_everygram_pipeline(3, medea_sents_tokenized)
lm_medea = MLE(3)
lm_medea.fit(medea_corpus, vocab)
for word in lm_medea.generate(50):
print(word, end = " ")
#**Output**
#"criticisms one may make some answer . </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
I had to add the output to the bottom of the code, but it's just a sea of sentence indicators. Any idea why this might be happening?
To clarify as well, this question is very similar to NLTK MLE model clarification trigrams and greater. The reason I marked it as not solving this question is because the answer is outdated and no longer resolves the problem within the model, raising a type error instead of generating unique trigram based language.