Let's say we have an RNN model that outputs the probability of a word given context (or no context) trained on a corpus. We can chain the probability of each word in a sequence to get the overall probability of the sentence itself. But, because we are chaining, the probability (or likelihood) of the sentence goes down as it's length increases. This is the same case even if we are using log probabilities.
Is there anyway we could normalize these probabilities? This is an interesting subproblem that I am facing while building a language model. I have a corpus of 9 million sentences of whose lengths vary from 2-30. But all of the sentences are valid ones and I am using these as the corpus to train the LM.
Now, I am taking a subset of data and making changes to it like shuffling or cutting the sentence into half, prepending or appending a random word and so on. This is to create a "fake sentence" that need not be valid. What I would like to do is get a threshold of some sort over the likelihood of all the valid sentences and then when I use the RNN to compute probability of the fake sentence, it should be fairly smaller or different form the calculated threshold.
tldr; sentences like
"the cat sat on the red mat"
"the cat sat on a mat"
"a cat sat on the red mat with brown coffee stains"
should all have a comparable probability/score/metric while sentences like
"cat cat mat on the brown red sat is"
"not mat in door on cat"
have a lower score.