Counting bigram frequencies in python

Question

Assume that i have a data that looks like

['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']

I would like to get the number of bigram that occurs only once, so

n1 == ('I', '<s>'), ('I', 'UNK'), ('UNK', '</s>')
len(n1) == 3

and number of bigram that occurs twice

n2 == ('<s>', 'I')
len(n2) == 1

I am thinking of storing the first word as sen[i] and the next word as sen[i + 1] but I am not sure if this is the right approach.

do you have this format or you can convert it to list format ? — Aaditya Ura, Oct 06 '17 at 18:37
nltk has a nice FreqDist function that should be pretty useful for this. — Peter, Oct 06 '17 at 18:40
Possible duplicate of [counting n-gram frequency in python nltk](https://stackoverflow.com/questions/14364762/counting-n-gram-frequency-in-python-nltk) — polo, Oct 06 '17 at 18:56
That first line isn't valid python. Is it supposed to be a string? A list of strings? Something else? Please correct. — altendky, Oct 06 '17 at 19:04
@altendky Sorry. It should loop over a list of list of strings (corpus) — Alibaba17, Oct 06 '17 at 19:09
@Alibaba17 I'll assume you mean a list of strings given what's there. — altendky, Oct 06 '17 at 20:56

gautamaggarwal · Accepted Answer · 2017-10-06T19:29:20.477

1

Considering your list:-

lis = ['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']

loop over the list to generate the tuples of bigrams and keep getting their frequency into the dictionary like this:-

bigram_freq = {}
length = len(lis)
for i in range(length-1):
    bigram = (lis[i], lis[i+1])
    if bigram not in bigram_freq:
        bigram_freq[bigram] = 0
    bigram_freq[bigram] += 1

Now, collect the bigrams with frequency = 1 and frequency = 2 like this:-

bigrams_with_frequency_one = 0
bigrams_with_frequency_two = 0
for bigram in bigram_freq:
    if bigram_freq[bigram] == 1:
        bigrams_with_frequency_one += 1
    elif bigram_freq[bigram] == 2:
        bigrams_with_frequency_two += 1

you have bigrams_with_frequency_one and bigrams_with_frequency_two as your results. I hope it helps!

edited Oct 06 '17 at 19:29

answered Oct 06 '17 at 18:54

gautamaggarwal

341
2
11

So, we just return the length of bigrams_with_frequency_one = [] bigrams_with_frequency_two = [] to get the frequency right? – Alibaba17 Oct 06 '17 at 19:18
Oh sorry, My bad... I did not notice. It was even easier than what I did....I am editing. Please accept my answer if it is useful. :) – gautamaggarwal Oct 06 '17 at 19:28
It was really helpful. Thanks! – Alibaba17 Oct 06 '17 at 22:27

score 0 · Answer 2 · answered Oct 06 '17 at 18:53

You can try this:

my_list = ['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']

bigrams = [(l[i-1], l[i]) for i in range(1, len(my_list))]
print(bigrams)
# [('<s>', 'I'), ('I', '<s>'), ('<s>', 'I'), ('I', 'UNK'), ('UNK', '</s>')]

d = {}

for c in set(bigrams):
    count = bigrams.count(c)
    d.setdefault(count, []).append(c)

print(d)
# {1: [('I', '<s>'), ('UNK', '</s>'), ('I', 'UNK')], 2: [('<s>', 'I')]}

Counting bigram frequencies in python

2 Answers2