If you look at the code for fitting the language model you can see that at its core, what fit()
does is to update the counts based on the documents in train_data
:
self.counts.update(self.vocab.lookup(sent) for sent in text)
However, notice that it updates those counts one sentence at a time. Each sentence is completely independent of each other. The model doesn't know what came before that sentence nor what comes after. Also, remember that you're training a trigram model, so the last two words in every sentence are ('</s>', '</s>')
. Therefore, the model learns that '</s>'
is followed by '</s>'
with a very high probability but it never learns that '</s>'
can sometimes be followed by '<s>'
.
So the easiest solution to your problem is just to manually start a new sentence (i.e. call generate()
again) every time you see '</s>'
. But let's say you don't want to do that and want the model to generate multiple sentences in one go.
From the docstring for padded_everygram_pipeline
:
Creates two iterators:
- sentences padded and turned into sequences of `nltk.util.everygrams`
- sentences padded as above and chained together for a flat stream of words
So unlike train_data
, padded_sents
does contain all your sentences as a single entry:
>>> tokenized_text= [['this', 'is', 'sentence', 'one'],
['this', 'is', 'sentence', 'two']
]
>>> train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
>>> padded_sents = list(padded_sents) #we need to do this because padded_sents is a generator and can only be gone through once
>>> print(padded_sents)
['<s>', '<s>', 'this', 'is', 'sentence', 'one', '</s>', '</s>', '<s>', '<s>', 'this', 'is', 'sentence', 'two', '</s>', '</s>']
>>> model = MLE(n)
>>> model.fit(padded_sents, padded_sents) #notice that I'm not using train_data
Good news: we now have an example of '<s>'
following '</s>'
. Bad news: the only possible trigrams that contain ngrams for two different sentences are ('</s>', '</s>', '<s>')
and ('</s>', '<s>', '<s>')
. So generate should now generate multiple sentences, but the content of those sentences will still be completely independent.
If you want the content of the previous sentence to influence the content of the next one, that's where things start to get complicated. Instead of passing your corpus to the model as a series of sentences, you could pass it as a series of paragraphs with multiple sentences each:
tokenized_text = [['this', 'is', 'sentence', 'one', '.', 'this', 'is', 'sentence', 'two', '.'],
['this', 'is', 'a', 'second', 'paragraph', '.']
]
That would work, but now '<s>'
and '</s>'
don't mean start and end of a sentence, they mean start and end of a paragraph. And generated paragraphs will still be independent of each other. You could also expand this to, instead of paragraphs, you generate series of paragraphs or entire books. It kind of depends what works best for your task.