0

I have a corpus of list of strings:

corpus = ["Hello I am Sam", "This is a white desk","I ate cereals", ...]

I want to build a language model (preferably using nltk) on this corpus, to get the probability of a word in a sentence. So, my later usage will be to get

P("Sam"| "I am")

in this corpus. I couldn't find - what is the best way to do so? How to train an ngram model and later get such probabilities?

Thanks!

Cranjis
  • 1,590
  • 8
  • 31
  • 64

1 Answers1

0

I would recommend using Markov chains https://en.wikipedia.org/wiki/Markov_chain

Very trivial example for your reference.

Assume that you are going to analyse 1-grams.

Analyzed texts:

monkey eats banana

dog eats bone

unigrams: monkey, eats,banana, dog,bone, BEGIN,END.

Each sentence starts with beginning.

Two transitions are possible:

BEGIN->monkey

BEGIN->dog

This means that there is 50% chance that sentence will begin with monkey.

Now after monkey there is 100% chance of transition monkey->eats (because there was no other monkey->* transition in analysed texts.

Now after eats there is 50% chance of banana and 50% chance of bone.

So in general with this model we can generate following sentences:

monkey eats banana
monkey eats bone
dog eats bone
dog eats banana

each of those has 25% to be produced

Note that bone and banana always transists into END

With digrams you will just split it to monkey eats -> banana END

This is just simplified big picture, hope it helps

Edit

As for smoothing mentioned in comment, go with Laplace.

Assume that you did saw each word one more than we really did.

So for example now we will have:

eats bone (2)
eats banana (2)
eats chocolate (1)
eats dog (1)

Of course in this case we have very small dataset, but for bigger dataset you will get something like:

eats bone (104)
eats banana (1031)
eats chocolate (1)
eats dog (3)
...
dfens
  • 5,413
  • 4
  • 35
  • 50