4

I'm trying to use NLTK to do some work on the New York Times Annotated Corpus which contains an XML file for each article (in the News Industry Text Format NITF).

I can parse individual documents with no problem like so:

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

I need to work on the whole corpus though. I tried doing this:

reader = XMLCorpusReader('corpora/nytimes', r'.*')

but this doesn't create a useable reader object. For instance

len(reader.words())

returns

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

How do I read this corpus into NLTK?

I'm new to NLTK so any help is greatly appreciated.

NAD
  • 615
  • 1
  • 7
  • 20

3 Answers3

5

I'm no NLTK expert, so there may be an easier way to do this, but naively I would suggest that you use Python's glob module. It supports Unix-stle pathname pattern expansion.

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

So that would give you the names of the files matching the expression specified, in list-form. Then depending on how many of them you want/need to have open at once, you could do:

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

As suggested by @waffle paradox:, you can also whittle this list of texts down to suit your specific needs.

zachguo
  • 6,200
  • 5
  • 30
  • 31
machine yearning
  • 9,889
  • 5
  • 38
  • 51
4

Here's the solution based on machine yearning and waffle paradox's comments. Build a list of articles using glob and pass them to XMLCorpusReader as a list:

from glob import glob
import re
years = glob('nltk_data/corpora/nytimes_test/*')
year_months = []
for year in years:
    year_months += glob(year+'/*')
    print year_months
days = []
for year_month in year_months:
    days += glob(year_month+'/*')
articles = []
for day in days:
    articles += glob(day+'/*.xml')
file_ids = []
for article in articles:
    file_ids.append(re.sub('nltk_data/corpora/nytimes_test','',article))
reader = XMLCorpusReader('nltk_data/corpora/nytimes_test', articles)
NAD
  • 615
  • 1
  • 7
  • 20
3

Yes you can specify multiple files. (from: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.xmldocs.XMLCorpusReader-class.html)

The problem here is that I suspect all your files are contained in a file structure along the lines of corpora/nytimes/year/month/date. XMLCorpusReader does not recursively traverse the directories for you. i.e., with your code above, XMLCorpusReader('corpora/nytimes', r'.*'), XMLCorpusReader only sees the xml files in corpora/nytimes/ (i.e., none, since there are only folders), not in any subfolders that corpora/nytimes may contain. In addition, you probably meant to use *.xml as your second parameter.

I would recommend traversing the folders yourself to build absolute paths (the docs above specify that explicit paths for the fileids parameter will work), or if you have a list of year/month/date combinations available, to use that to your advantage.

waffle paradox
  • 2,755
  • 18
  • 19