Folks, I've put together a set of corpora for NLTK which are basically simple XML files. I can load it just fine like that:
>>> from nltk.corpus import cicero
>>> print cicero.fileids()
['cicero_academica.xml', 'cicero_arati_phaenomena.xml', ...]
Now, I understand XMLCorpusReader won't give my the content of all those XML files at once because it expects only one single XML at once to processe, right? I tried to bypass it writing a for loop, putting it all in a list and give it to XMLCorpusReader but no luck...
Simply put: how could I load multiple XML corpora with NLTK and run .words() in all of them at once? Working code examples would be good.
It seems that I can't load all XML at once and use them, say, with class Text() to, say again, print concordances of a word through ALL the XML files, not only through one at a time.
Is there any work around or real NLTK solution for this? Should I write a magical inherited class of XMLCorpusReader that does it? Should I drop XML and go for flat files...?
This is similar to my problem, but so far I think the answers there are not really useful NLTK-wise: Can NLTK's XMLCorpusReader be used on a multi-file corpus?