2

Let's say we have an RNN model that outputs the probability of a word given context (or no context) trained on a corpus. We can chain the probability of each word in a sequence to get the overall probability of the sentence itself. But, because we are chaining, the probability (or likelihood) of the sentence goes down as it's length increases. This is the same case even if we are using log probabilities.

Is there anyway we could normalize these probabilities? This is an interesting subproblem that I am facing while building a language model. I have a corpus of 9 million sentences of whose lengths vary from 2-30. But all of the sentences are valid ones and I am using these as the corpus to train the LM.

Now, I am taking a subset of data and making changes to it like shuffling or cutting the sentence into half, prepending or appending a random word and so on. This is to create a "fake sentence" that need not be valid. What I would like to do is get a threshold of some sort over the likelihood of all the valid sentences and then when I use the RNN to compute probability of the fake sentence, it should be fairly smaller or different form the calculated threshold.

tldr; sentences like

"the cat sat on the red mat"
"the cat sat on a mat"
"a cat sat on the red mat with brown coffee stains"

should all have a comparable probability/score/metric while sentences like

"cat cat mat on the brown red sat is"
"not mat in door on cat"

have a lower score.

Sanjay Krishna
  • 157
  • 1
  • 7

1 Answers1

0

You can introduce a special word END-OF-SENTENCE, and predict its probability along with the rest of the words. In this case, you will be able to model the distribution on sentence lengths correctly. There is a good example in the exercise 4 in the NLP book by Jurafsky.

Indeed, the sentence "A cat sat on the red mat with brown coffee stains END" is more probable that "A cat sat on the red mat with END", just because sentences rarely end with "with". And if your RNN is good enough, it will reflect this.

If you still want to normalize sentence probabilities, you can calculate perplexity (mean log probability per word), like in this question where the concept is shown with a simple 1-gram model.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
David Dale
  • 10,958
  • 44
  • 73
  • Thanks for some great pointers. Although, I definitely think I need to normalize because the word distribution within corpus I am working on is crazy. While the most popular words occured for a few thousand times, the least popular words occurred just once and sentences which have these less frequent words are also valid. Coupling this fact with varying lengths, the probabilities quickly drop to 0 and beats the purpose of LM. Honestly, I don't think just normalizing with respect to lengths would also help in achieving a threshold but let me verify that with the perplexity usage. – Sanjay Krishna Mar 02 '18 at 22:45