How to load multiple XML files of corpora with NLTK and use it as a whole with Text class?

Question

Folks, I've put together a set of corpora for NLTK which are basically simple XML files. I can load it just fine like that:

>>> from nltk.corpus import cicero
>>> print cicero.fileids()
['cicero_academica.xml', 'cicero_arati_phaenomena.xml', ...]

Now, I understand XMLCorpusReader won't give my the content of all those XML files at once because it expects only one single XML at once to processe, right? I tried to bypass it writing a for loop, putting it all in a list and give it to XMLCorpusReader but no luck...

Simply put: how could I load multiple XML corpora with NLTK and run .words() in all of them at once? Working code examples would be good.

It seems that I can't load all XML at once and use them, say, with class Text() to, say again, print concordances of a word through ALL the XML files, not only through one at a time.

Is there any work around or real NLTK solution for this? Should I write a magical inherited class of XMLCorpusReader that does it? Should I drop XML and go for flat files...?

This is similar to my problem, but so far I think the answers there are not really useful NLTK-wise: Can NLTK's XMLCorpusReader be used on a multi-file corpus?

Sidenote: while considering going for flat files I found this: http://pastebin.com/qRPeqZmR (sounds like what I need, but quite hackish as I was expecting NLTK to solve most of these problems itself...). Still pondering about it though. — caio1982, Apr 16 '12 at 19:31
I've not used NLTK and only just downloaded it to try to help you out with this. But: you're saying you can't just do `cicero.words()` and get the words from all the files? It knows the fileids, but won't read them? Show how you've defined your corpus class, perhaps? — kindall, Apr 16 '12 at 20:02
Just posted a answer to myself with a working example. Thanks for taking a look at it though :-) — caio1982, Apr 17 '12 at 02:24

score 0 · Accepted Answer · answered Apr 17 '12 at 02:23

Not exactly what I was after but it solved the problem for now. I'll play around with it a bit more, so perhaps this will turn out different later on. Anyway, a small working test :-)

# http://stackoverflow.com/questions/6849600/does-anyone-have-a-categorized-xml-corpus-reader-for-nltk
from CatXMLReader import CategorizedXMLCorpusReader

from nltk.corpus import cicero
from nltk import Text

fileids = cicero.abspaths()
reader = CategorizedXMLCorpusReader('/', fileids, cat_file='cats.txt')
words = Text(reader.words(fileids))
print words.concordance('et')

How to load multiple XML files of corpora with NLTK and use it as a whole with Text class?

1 Answers1