Can NLTK's XMLCorpusReader be used on a multi-file corpus?

Question

I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).

I can parse individual documents with no problem like so:

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

I need to work on the whole corpus though. I tried doing this:

reader = XMLCorpusReader('corpora/nytimes', r'.*')

but this doesn't create a useable reader object. For instance

len(reader.words())

returns

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

How do I read this corpus into NLTK?

I'm new to NLTK so any help is greatly appreciated.

score 5 · Accepted Answer · edited Sep 16 '12 at 03:49

I'm no NLTK expert, so there may be an easier way to do this, but naively I would suggest that you use Python's glob module. It supports Unix-stle pathname pattern expansion.

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

So that would give you the names of the files matching the expression specified, in list-form. Then depending on how many of them you want/need to have open at once, you could do:

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

As suggested by @waffle paradox:, you can also whittle this list of texts down to suit your specific needs.

NAD · Answer 2 · 2011-07-27T15:27:59.687

Here's the solution based on machine yearning and waffle paradox's comments. Build a list of articles using glob and pass them to XMLCorpusReader as a list:

from glob import glob
import re
years = glob('nltk_data/corpora/nytimes_test/*')
year_months = []
for year in years:
    year_months += glob(year+'/*')
    print year_months
days = []
for year_month in year_months:
    days += glob(year_month+'/*')
articles = []
for day in days:
    articles += glob(day+'/*.xml')
file_ids = []
for article in articles:
    file_ids.append(re.sub('nltk_data/corpora/nytimes_test','',article))
reader = XMLCorpusReader('nltk_data/corpora/nytimes_test', articles)

score 3 · Answer 3 · answered Jul 26 '11 at 23:55

Yes you can specify multiple files. (from: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.xmldocs.XMLCorpusReader-class.html)

The problem here is that I suspect all your files are contained in a file structure along the lines of corpora/nytimes/year/month/date. XMLCorpusReader does not recursively traverse the directories for you. i.e., with your code above, XMLCorpusReader('corpora/nytimes', r'.*'), XMLCorpusReader only sees the xml files in corpora/nytimes/ (i.e., none, since there are only folders), not in any subfolders that corpora/nytimes may contain. In addition, you probably meant to use *.xml as your second parameter.

I would recommend traversing the folders yourself to build absolute paths (the docs above specify that explicit paths for the fileids parameter will work), or if you have a list of year/month/date combinations available, to use that to your advantage.

Thanks Waffle Paradox. That's very helpful. – NAD Jul 27 '11 at 15:09 — NAD, Jul 27 '11 at 15:09

Can NLTK's XMLCorpusReader be used on a multi-file corpus?

3 Answers3

Linked