0

Problem

I have a problem where I have one word and certain restrictions on what the second might be (for example "I _o__"). What I want is a list of words like "rode", "love", and "most" and telling me how common each one is following "I".

I want to be able to get a list of two-tuples (nextword, probability) where nextword is a word that satisfies a regex and probability is the chance that nextword follows after the first word, given by (number of times it is seen after the first word in a corpus of text)/(number of times the first word appears).

Like this:

[(nextword, follow_probability("I", nextword) for nextword in findwords('.o..')]

My approach to this is to first generate a list of possible words that satisfy the regex, and then look up the probability of each. The first part is easy, but I don't know how to do the second part. Ideally I would be able to have a function taking an argument for each word and returning the probability the second follows the first.

What I Have Tried

  • Using the markovify library to generate a chain and the sentences with a certain starting word and a state size of 1
  • Using nltk's BigramCollocationFinder
Riley Martine
  • 193
  • 2
  • 9

1 Answers1

2

Try something like this:

from collections import Counter, deque
from nltk.tokenize import regexp_tokenize
import pandas as pd

def grouper(iterable, length=2):
    i = iter(iterable)
    q = deque(map(next, [i] * length))
    while True:
        yield tuple(q)
        try:
            q.append(next(i))
            q.popleft()
        except StopIteration:
            break

def tokenize(text):
    return [word.lower() for word in regexp_tokenize(text, r'\w+')]

def follow_probability(word1, word2, vec):
    subvec = vec.loc[word1]
    try:
        ct = subvec.loc[word2]
    except:
        ct = 0
    return float(ct) / (subvec.sum() or 1)

text = 'This is some training text this this'
tokens = tokenize(text)
markov = list(grouper(tokens))
vec = pd.Series(Counter(markov))

follow_probability('this', 'is', vec)

Output:

0.5
C. Feenstra
  • 593
  • 3
  • 11
  • That's good, but are there any functions or classes to use in the `nltk` package directly instead of writing yourself? – C.K. Sep 06 '20 at 15:08