6

I've seen tons of documentation all over the web about how the python NLTK makes it easy to compute bigrams of words.

What about letters?

What I want to do is plug in a dictionary and have it tell me the relative frequencies of different letter pairs.

Ultimately I'd like to make some kind of markov process to generate likely-looking (but fake) words.

jdotjdot
  • 16,134
  • 13
  • 66
  • 118
isthmuses
  • 1,316
  • 1
  • 17
  • 27
  • 1
    What you can do is simply take your string of words, but have your tokenizer tokenize by letter instead of by word, and then run your bigram model on that letter-token set. – jdotjdot Jan 05 '13 at 08:49

2 Answers2

5

Here is an example (modulo Relative Frequency Distribution) using Counter from the collections module:

#!/usr/bin/env python

import sys
from collections import Counter
from itertools import islice
from pprint import pprint

def split_every(n, iterable):
    i = iter(iterable)
    piece = ''.join(list(islice(i, n)))
    while piece:
        yield piece
        piece = ''.join(list(islice(i, n)))

def main(text):
    """ return ngrams for text """
    freqs = Counter()
    for pair in split_every(2, text): # adjust n here
        freqs[pair] += 1
    return freqs

if __name__ == '__main__':
    with open(sys.argv[1]) as handle:
        freqs = main(handle.read()) 
        pprint(freqs.most_common(10))

Usage:

$ python 14168601.py lorem.txt
[('t ', 32),
 (' e', 20),
 ('or', 18),
 ('at', 16),
 (' a', 14),
 (' i', 14),
 ('re', 14),
 ('e ', 14),
 ('in', 14),
 (' c', 12)]
miku
  • 181,842
  • 47
  • 306
  • 310
5

If bigrams is all you need, you don't need NLTK. You can simply do it as follows:

from collections import Counter
text = "This is some text"
bigrams = Counter(x+y for x, y in zip(*[text[i:] for i in range(2)]))
for bigram, count in bigrams.most_common():
    print bigram, count

Output:

is 2
s  2
me 1
om 1
te 1
 t 1
 i 1
e  1
 s 1
hi 1
so 1
ex 1
Th 1
xt 1
vpekar
  • 3,275
  • 1
  • 19
  • 16