0

I'm trying to compute frequency analysis on a Swahili corpus which I'm compiling. At the moment, this is what I have:

import os
import sys
from collections import Counter
import re


path = 'C:\Python27\corpus\\'
cnt = Counter()
listing = os.listdir(path)
for infile in listing:
    print "Currently parsing: " + path + infile
    corpus = open(path+infile, "r")
    for lines in corpus:
        for words in lines.split(' '):
            if len(words) >= 2 and re.match("^[A-Za-z]*$", words):
                words = words.strip()
                cnt[words] += 1
    print "Completed parsing: " + path + infile
    #output = open(n + ".out", "w")
    #print "current file is: " + infile

    corpus.close()
    #output.close()
for (counter, content) in enumerate(cnt.most_common(1000)):
    print str(counter+1) + " " + str(content)

So this program will iterate over all files in a given path, read in the text of each file, and display the 1000 most frequent words. Here's the issue: Swahili is a agglutinative language which means that infixes, suffixes, and prefixes are added to words to convey things like tense, causation, subjunctive mood, prepositions, etc.

So a verb root like '-fanya' meaning 'to do' could be nitakufanya - 'I'm going to do you.' As a result, this frequency list is biased towards connecting words like 'for', 'in', 'out' which don't use said infixes.

Is there a simplistic way to look at words like 'nitakufanya' or 'tunafanya' and include the word 'fanya' to the count total?

Some potential things to look at:

  1. Verb roots will be at the end of the word
  2. The subject markers at the beginning of a word can be one of the following: 'ni' (I), 'u' (you), 'a' (he/she), 'wa' (they), 'tu' (we), 'm' (you all)
  3. Subject markers are followed by tense markers which are either: 'na' (present), 'li' (past), 'ta' (future), 'ji' (reflexive), 'nge' (conditional)

Thanks

Parseltongue
  • 11,157
  • 30
  • 95
  • 160

2 Answers2

0

first do the frequency analysis without worrying about the prefixes. then fix the prefixes from the frequency list. To do this easier sort the list based on the words so that the words with the same prefix are next to each other. this will make even hand-pruning quite easy.

Markus Mikkolainen
  • 3,397
  • 18
  • 21
0

You can do:

root_words = [re.sub(
    '^(ni|u|a|wa|tu|m)(na|li|ta|ji|nge)',
    '', x) for word in words]

to remove the prefixes from each word, but there is not much you can do if root words start with these sequences as well.

D K
  • 5,530
  • 7
  • 31
  • 45