how to create the bigram matrix?

Question

I want to make a matrix of the bigram model. How can I do it? Any suggestions which match my code, please?

 import nltk
 from collections import Counter


 import codecs
 with codecs.open("Pezeshki339.txt",'r','utf8') as file:
     for line in file:
       token=line.split()

 spl = 80*len(token)/100
 train = token[:int(spl)]
 test = token[int(spl):]
 print(len(test))
 print(len(train))
 cn=Counter(train)
 known_words=([word for word,v in cn.items() if v>1])# removes the rare  words and puts them in a list

 bigram=nltk.bigrams(known_words)
 frequency=nltk.FreqDist(bigram)
 for f in frequency:
       print(f,frequency[f])

I need something like:

          w1        w2      w3          ....wn
 w1     n(w1w1)  n(w1w2)  n(w1w3)      n(w1wn)
 w2     n(w2w1)  n(w2w1)  n(w2w3)      n(w2wn)
 w3   .
  .
  .
  .
  wn

The same for all rows and columns.

@bebop actually I need a matrix whose rows and columns are the words in train text and the frequencies are the the frequency of bi-gram model. — marysd, Jun 09 '15 at 09:30
@bebop I edited my question and showed you an example above as You needed :-) — marysd, Jun 09 '15 at 10:59

score 2 · Accepted Answer · answered Jun 09 '15 at 10:59

Since you need a "matrix" of words, you'll use a dictionary-like class. You want a dictionary of all first words in bigrams. To make a two-dimensional matrix, it will be a dictionary of dictionaries: Each value is another dictionary, whose keys are the second words of the bigrams and values are whatever you're tracking (probably number of occurrences).

In the NLTK you can do it quickly with a ConditionalFreqDist():

mybigrams = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))

But I recommend you build your bigram table step by step. You'll understand it better, and you need to before you can use it.

how to create the bigram matrix?

1 Answers1