0

I'm trying to compute TTR of the Capitol Words corpus using lemmas over the entire vocabulary of each speaker.

I'm also trying to have defaultdict shuffle through each entry and then give a TTR percentage per each speaker. So far I have the code above, but not sure how to fix it so it works...

import nltk
cw = ReadCorpus(root)
from collections import defaultdict 
speaker_TTR = defaultdict(int)
for record in cw:
    total_words = set([])
    N = 0
    text = record['text']
    processed = nlp(text)
    textw = [t.lemma_ for t in processed]
    N += len(textw)
    total_words |= set(textw)
    V = len(total_types)
    TTR = float(V)/float(N)
    speaker_TTR[record['speaker_name']] += 1

print "V = ",V
print "N = ",N
print "TTR = ",TTR
smci
  • 32,567
  • 20
  • 113
  • 146
Gerold
  • 11
  • 2
  • The code is giving me values larger than one, but I need values lesser than one for TTR. – Gerold Mar 13 '18 at 03:07
  • [Type-Token Ratio](https://en.wikipedia.org/wiki/Lexical_density), in computational linguistics – smci Mar 13 '18 at 06:32
  • Please show an example paragraph and example values of TTR. Also **make the code reproducible**, and add the missing imports. – smci Mar 13 '18 at 06:46
  • You can rewrite this much more cleanly using [collections.Counter](https://docs.python.org/2/library/collections.html#counter-objects) – smci Mar 13 '18 at 06:50

0 Answers0