I'm trying to compute TTR of the Capitol Words corpus using lemmas over the entire vocabulary of each speaker.
I'm also trying to have defaultdict
shuffle through each entry and then give a TTR percentage per each speaker. So far I have the code above, but not sure how to fix it so it works...
import nltk
cw = ReadCorpus(root)
from collections import defaultdict
speaker_TTR = defaultdict(int)
for record in cw:
total_words = set([])
N = 0
text = record['text']
processed = nlp(text)
textw = [t.lemma_ for t in processed]
N += len(textw)
total_words |= set(textw)
V = len(total_types)
TTR = float(V)/float(N)
speaker_TTR[record['speaker_name']] += 1
print "V = ",V
print "N = ",N
print "TTR = ",TTR