How to train ngram model on my own corpus

Question

I have a corpus of list of strings:

corpus = ["Hello I am Sam", "This is a white desk","I ate cereals", ...]

I want to build a language model (preferably using nltk) on this corpus, to get the probability of a word in a sentence. So, my later usage will be to get

P("Sam"| "I am")

in this corpus. I couldn't find - what is the best way to do so? How to train an ngram model and later get such probabilities?

Thanks!

dfens · Answer 1 · 2018-12-01T18:59:28.830

0

I would recommend using Markov chains https://en.wikipedia.org/wiki/Markov_chain

Very trivial example for your reference.

Assume that you are going to analyse 1-grams.

Analyzed texts:

monkey eats banana

dog eats bone

unigrams: monkey, eats,banana, dog,bone, BEGIN,END.

Each sentence starts with beginning.

Two transitions are possible:

BEGIN->monkey

BEGIN->dog

This means that there is 50% chance that sentence will begin with monkey.

Now after monkey there is 100% chance of transition monkey->eats (because there was no other monkey->* transition in analysed texts.

Now after eats there is 50% chance of banana and 50% chance of bone.

So in general with this model we can generate following sentences:

monkey eats banana
monkey eats bone
dog eats bone
dog eats banana

each of those has 25% to be produced

Note that bone and banana always transists into END

With digrams you will just split it to monkey eats -> banana END

This is just simplified big picture, hope it helps

Edit

As for smoothing mentioned in comment, go with Laplace.

Assume that you did saw each word one more than we really did.

So for example now we will have:

eats bone (2)
eats banana (2)
eats chocolate (1)
eats dog (1)

Of course in this case we have very small dataset, but for bigger dataset you will get something like:

eats bone (104)
eats banana (1031)
eats chocolate (1)
eats dog (3)
...

edited Dec 01 '18 at 18:59

answered Nov 28 '18 at 15:00

dfens

5,413
4
35
50

for my need I need to use ngram - the problem is how to add smoothing after collected frequencies? – Cranjis Nov 29 '18 at 07:52
you can use Laplace, that would mean that would mean just +1 to each occurence. Answer updated – dfens Dec 01 '18 at 18:59
How do I add it? just add +1 to each entry of the dict? – Cranjis Dec 03 '18 at 09:10
1

Basically yes - but your parameter does not need to equal "1". It can be 0.21 or 3 – dfens Dec 03 '18 at 09:13
and what is the best way to decide it? – Cranjis Dec 03 '18 at 09:18
Typically it is set by trial and error, I don't have a rule of thumb for you. Please refer to https://www3.nd.edu/~dchiang/teaching/nlp/2017/notes/chapter10v1.pdf page 74 – dfens Dec 03 '18 at 09:22

How to train ngram model on my own corpus

1 Answers1