Problem
I have a problem where I have one word and certain restrictions on what the second might be (for example "I _o__"). What I want is a list of words like "rode", "love", and "most" and telling me how common each one is following "I".
I want to be able to get a list of two-tuples (nextword, probability) where nextword is a word that satisfies a regex and probability is the chance that nextword follows after the first word, given by (number of times it is seen after the first word in a corpus of text)/(number of times the first word appears).
Like this:
[(nextword, follow_probability("I", nextword) for nextword in findwords('.o..')]
My approach to this is to first generate a list of possible words that satisfy the regex, and then look up the probability of each. The first part is easy, but I don't know how to do the second part. Ideally I would be able to have a function taking an argument for each word and returning the probability the second follows the first.
What I Have Tried
- Using the markovify library to generate a chain and the sentences with a certain starting word and a state size of 1
- Using nltk's BigramCollocationFinder