Assume that i have a data that looks like
['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']
I would like to get the number of bigram that occurs only once, so
n1 == ('I', '<s>'), ('I', 'UNK'), ('UNK', '</s>')
len(n1) == 3
and number of bigram that occurs twice
n2 == ('<s>', 'I')
len(n2) == 1
I am thinking of storing the first word as sen[i] and the next word as sen[i + 1] but I am not sure if this is the right approach.