I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).
I can parse individual documents with no problem like so:
from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')
I need to work on the whole corpus though. I tried doing this:
reader = XMLCorpusReader('corpora/nytimes', r'.*')
but this doesn't create a useable reader object. For instance
len(reader.words())
returns
raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string
How do I read this corpus into NLTK?
I'm new to NLTK so any help is greatly appreciated.