I'm trying to compute frequency analysis on a Swahili corpus which I'm compiling. At the moment, this is what I have:
import os
import sys
from collections import Counter
import re
path = 'C:\Python27\corpus\\'
cnt = Counter()
listing = os.listdir(path)
for infile in listing:
print "Currently parsing: " + path + infile
corpus = open(path+infile, "r")
for lines in corpus:
for words in lines.split(' '):
if len(words) >= 2 and re.match("^[A-Za-z]*$", words):
words = words.strip()
cnt[words] += 1
print "Completed parsing: " + path + infile
#output = open(n + ".out", "w")
#print "current file is: " + infile
corpus.close()
#output.close()
for (counter, content) in enumerate(cnt.most_common(1000)):
print str(counter+1) + " " + str(content)
So this program will iterate over all files in a given path, read in the text of each file, and display the 1000 most frequent words. Here's the issue: Swahili is a agglutinative language which means that infixes, suffixes, and prefixes are added to words to convey things like tense, causation, subjunctive mood, prepositions, etc.
So a verb root like '-fanya' meaning 'to do' could be nitakufanya - 'I'm going to do you.' As a result, this frequency list is biased towards connecting words like 'for', 'in', 'out' which don't use said infixes.
Is there a simplistic way to look at words like 'nitakufanya' or 'tunafanya' and include the word 'fanya' to the count total?
Some potential things to look at:
- Verb roots will be at the end of the word
- The subject markers at the beginning of a word can be one of the following: 'ni' (I), 'u' (you), 'a' (he/she), 'wa' (they), 'tu' (we), 'm' (you all)
- Subject markers are followed by tense markers which are either: 'na' (present), 'li' (past), 'ta' (future), 'ji' (reflexive), 'nge' (conditional)
Thanks