Computing TTR on corpus

Asked Mar 13 '18 at 02:56

Active Mar 13 '18 at 06:41

Viewed 684 times

I'm trying to compute TTR of the Capitol Words corpus using lemmas over the entire vocabulary of each speaker.

I'm also trying to have defaultdict shuffle through each entry and then give a TTR percentage per each speaker. So far I have the code above, but not sure how to fix it so it works...

import nltk
cw = ReadCorpus(root)
from collections import defaultdict 
speaker_TTR = defaultdict(int)
for record in cw:
    total_words = set([])
    N = 0
    text = record['text']
    processed = nlp(text)
    textw = [t.lemma_ for t in processed]
    N += len(textw)
    total_words |= set(textw)
    V = len(total_types)
    TTR = float(V)/float(N)
    speaker_TTR[record['speaker_name']] += 1

print "V = ",V
print "N = ",N
print "TTR = ",TTR

edited Mar 13 '18 at 06:41

smci

32,567
20
113
146

asked Mar 13 '18 at 02:56

Gerold

The code is giving me values larger than one, but I need values lesser than one for TTR. – Gerold Mar 13 '18 at 03:07
[Type-Token Ratio](https://en.wikipedia.org/wiki/Lexical_density), in computational linguistics – smci Mar 13 '18 at 06:32
Please show an example paragraph and example values of TTR. Also **make the code reproducible**, and add the missing imports. – smci Mar 13 '18 at 06:46
You can rewrite this much more cleanly using [collections.Counter](https://docs.python.org/2/library/collections.html#counter-objects) – smci Mar 13 '18 at 06:50

Computing TTR on corpus

0 Answers0