I'm attempting to convert many, many RTF files to a) strip them of their metadata and b) read them into an NLTK corpus for analysis (frequency distributions, POS tagging, and LDA topic modeling). I have two sets of working code but would like to combine and am having difficulty doing so.
This strips RTF:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter
doc = Rtf15Reader.read(open('sample.rtf'))
print PlaintextWriter.write(doc).getvalue()
This creates a corpus:
corpusdn = '/Users/example/'
dncorpus = nltk.corpus.PlaintextCorpusReader(corpusdn, '.*')
dn = []
for infile in sorted(dncorpus.fileids()):
input = open(infile, 'r')
dn.append(input.read())
print infile
I have too many files to realistically strip them by hand, so I'd like to combine the two commands but can't figure out how to do it. (Granted, I am a Python newbie.) Any tips would be welcome.