How do I batch convert RTF files for NLTK processing?

Asked Dec 01 '14 at 03:23

Active Dec 01 '14 at 04:52

Viewed 313 times

I'm attempting to convert many, many RTF files to a) strip them of their metadata and b) read them into an NLTK corpus for analysis (frequency distributions, POS tagging, and LDA topic modeling). I have two sets of working code but would like to combine and am having difficulty doing so.

This strips RTF:

from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter

doc = Rtf15Reader.read(open('sample.rtf'))

print PlaintextWriter.write(doc).getvalue()

This creates a corpus:

corpusdn = '/Users/example/'
dncorpus = nltk.corpus.PlaintextCorpusReader(corpusdn, '.*')
dn = []

for infile in sorted(dncorpus.fileids()):
    input = open(infile, 'r')
    dn.append(input.read())
    print infile

I have too many files to realistically strip them by hand, so I'd like to combine the two commands but can't figure out how to do it. (Granted, I am a Python newbie.) Any tips would be welcome.

edited Dec 01 '14 at 04:52

Aacini

65,180
12
72
108

asked Dec 01 '14 at 03:23

rmoon

possible duplicate of [Creating a new corpus with NLTK](http://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk) – alvas Dec 01 '14 at 14:04
@alvas this question uses some of the code from that question, but I'm asking how to combine an RTF strip command with it. – rmoon Dec 01 '14 at 16:27
can you copy and paste a sample of your RTF file. – alvas Dec 01 '14 at 18:20

How do I batch convert RTF files for NLTK processing?

0 Answers0