1

I am learning NLTK and have a question about data preprocessing and the MLE model. Currently I am trying to generate words with the MLE model. The problem is that when I pick an n>=3. My model will produce words completely fine until it gets to a period ('.'). Afterwards, it will only output end-of-sentence paddings.

This is essentially what I am doing.


tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                  for sent in sent_tokenize(MYTEXTINPUT)]

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_sents)
model.generate(20)

# OUTPUT: 
eg:  
blah beep bloop . </s> </s> </s> </s> </s> </s> </s> </s> (continues till 20 words reached)

I suspect that the answer to my problem lies in the way my n-grams are prepared for the model. So is there a way to format/prepare the data so that, for example, trigrams, are generated like this --> ( . , </s>, <s> ) so that the model will try to start another sentence again and output more words ?

Or is there another way to avoid my problem written above ?

alvas
  • 115,346
  • 109
  • 446
  • 738
arm.u
  • 55
  • 2
  • 7

3 Answers3

1

If you look at the code for fitting the language model you can see that at its core, what fit() does is to update the counts based on the documents in train_data:

self.counts.update(self.vocab.lookup(sent) for sent in text)

However, notice that it updates those counts one sentence at a time. Each sentence is completely independent of each other. The model doesn't know what came before that sentence nor what comes after. Also, remember that you're training a trigram model, so the last two words in every sentence are ('</s>', '</s>'). Therefore, the model learns that '</s>' is followed by '</s>' with a very high probability but it never learns that '</s>' can sometimes be followed by '<s>'.

So the easiest solution to your problem is just to manually start a new sentence (i.e. call generate() again) every time you see '</s>'. But let's say you don't want to do that and want the model to generate multiple sentences in one go.

From the docstring for padded_everygram_pipeline:

Creates two iterators:
- sentences padded and turned into sequences of `nltk.util.everygrams`
- sentences padded as above and chained together for a flat stream of words

So unlike train_data, padded_sents does contain all your sentences as a single entry:

>>> tokenized_text= [['this', 'is', 'sentence', 'one'],
                     ['this', 'is', 'sentence', 'two']
                     ]
>>> train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
>>> padded_sents = list(padded_sents) #we need to do this because padded_sents is a generator and can only be gone through once
>>> print(padded_sents)
['<s>', '<s>', 'this', 'is', 'sentence', 'one', '</s>', '</s>', '<s>', '<s>', 'this', 'is', 'sentence', 'two', '</s>', '</s>']
>>> model = MLE(n)
>>> model.fit(padded_sents, padded_sents) #notice that I'm not using train_data

Good news: we now have an example of '<s>' following '</s>'. Bad news: the only possible trigrams that contain ngrams for two different sentences are ('</s>', '</s>', '<s>') and ('</s>', '<s>', '<s>'). So generate should now generate multiple sentences, but the content of those sentences will still be completely independent.

If you want the content of the previous sentence to influence the content of the next one, that's where things start to get complicated. Instead of passing your corpus to the model as a series of sentences, you could pass it as a series of paragraphs with multiple sentences each:

tokenized_text = [['this', 'is', 'sentence', 'one', '.', 'this', 'is', 'sentence', 'two', '.'],
                  ['this', 'is', 'a', 'second', 'paragraph', '.']
                  ]

That would work, but now '<s>' and '</s>' don't mean start and end of a sentence, they mean start and end of a paragraph. And generated paragraphs will still be independent of each other. You could also expand this to, instead of paragraphs, you generate series of paragraphs or entire books. It kind of depends what works best for your task.

acattle
  • 3,073
  • 1
  • 16
  • 21
0

The question is when generating from a language model, when to stop generating.

A simple idiom for generating would have been:

From this tutorial snippet, in code that can be achieved with:

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

But there's actually a similar generate() function already in NLTK, from https://github.com/nltk/nltk/blob/develop/nltk/lm/api.py#L182

def generate(self, num_words=1, text_seed=None, random_seed=None):
    """Generate words from the model.
    :param int num_words: How many words to generate. By default 1.
    :param text_seed: Generation can be conditioned on preceding context.
    :param random_seed: A random seed or an instance of `random.Random`. If provided,
    makes the random sampling part of generation reproducible.
    :return: One (str) word or a list of words generated from model.
    Examples:
    >>> from nltk.lm import MLE
    >>> lm = MLE(2)
    >>> lm.fit([[("a", "b"), ("b", "c")]], vocabulary_text=['a', 'b', 'c'])
    >>> lm.fit([[("a",), ("b",), ("c",)]])
    >>> lm.generate(random_seed=3)
    'a'
    >>> lm.generate(text_seed=['a'])
    'b'
    """
    text_seed = [] if text_seed is None else list(text_seed)
    random_generator = _random_generator(random_seed)
    # This is the base recursion case.
    if num_words == 1:
        context = (
            text_seed[-self.order + 1 :]
            if len(text_seed) >= self.order
            else text_seed
        )
        samples = self.context_counts(self.vocab.lookup(context))
        while context and not samples:
            context = context[1:] if len(context) > 1 else []
            samples = self.context_counts(self.vocab.lookup(context))
        # Sorting samples achieves two things:
        # - reproducible randomness when sampling
        # - turns Mapping into Sequence which `_weighted_choice` expects
        samples = sorted(samples)
        return _weighted_choice(
            samples,
            tuple(self.score(w, context) for w in samples),
            random_generator,
        )
    # We build up text one word at a time using the preceding context.
    generated = []
    for _ in range(num_words):
        generated.append(
            self.generate(
                num_words=1,
                text_seed=text_seed + generated,
                random_seed=random_generator,
            )
        )
    return generated

More details from the implementation on https://github.com/nltk/nltk/pull/2300 (note, see the hidden comments in the code review)

alvas
  • 115,346
  • 109
  • 446
  • 738
0

It's been about two and a half years since this was answered, so I think the model has changed.

If you attempt to fit the model now using only padded_sents, it raises a TypeError

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/Documents/GitHub Cell 14 in <cell line: 25>()
     21 #print(padded_sents)
     23 model = MLE(n)
---> 25 model.fit(padded_sents, padded_sents)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/nltk/lm/api.py:109, in LanguageModel.fit(self, text, vocabulary_text)
    105         raise ValueError(
    106             "Cannot fit without a vocabulary or text to create it from."
    107         )
    108     self.vocab.update(vocabulary_text)
--> 109 self.counts.update(self.vocab.lookup(sent) for sent in text)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/nltk/lm/counter.py:118, in NgramCounter.update(self, ngram_text)
    116 for ngram in sent:
    117     if not isinstance(ngram, tuple):
--> 118         raise TypeError(
    119             "Ngram <{}> isn't a tuple, " "but {}".format(ngram, type(ngram))
    120         )
    122     ngram_order = len(ngram)
    123     if ngram_order == 1:

TypeError: Ngram <<> isn't a tuple, but <class 'str'>

So the answer from @acattle looks like it might not work anymore.