Python 3.4, parsing GB++ size XML Wikipedia dump files using etree.iterparse. I want to test within the current matched <page>
element for its <ns>
value, depending on the latter value I then want export the source XML of the whole <page>
object and all its contents including any elements nested within it, i.e. the XML of a whole article.
I can iterate the <page>
objects and find the ones I want, but then all available functions seem to want to read text/attribute values, whereas I simply want a utf8 string copy of the source file's XML code for the complete in scope <page>
object. Is this possible?
A cut-down version of the XML looks like this:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xml:lang="en">
<page>
<title>Some Article</title>
<ns>0</ns>
<revision>
<timestamp>2017-07-27T00:59:41Z</timestamp>
<text xml:space="preserve">some text</text>
</revision>
</page>
<page>
<title>User:Wonychifans</title>
<ns>2</ns>
<revision>
<text xml:space="preserve">blah blah</text>
</revision>
</page>
</mediawiki>
The python code getting me to the <ns>
value test is here:
``from lxml import etree
# store namespace string for all elements (only one used in Wikipedia XML docs)
NAMESPACE = '{http://www.mediawiki.org/xml/export-0.10/}'
ns = {'wiki' : 'http://www.mediawiki.org/xml/export-0.10/'}
context = etree.iterparse('src.xml', events=('end',))
for event, elem in context:
# at end of parsing each
if elem.tag == (NAMESPACE+'page') and event == 'end':
tagNs = elem.find('wiki:ns',ns)
if tagNs is not None:
nsValue = tagNs.text
if nsValue == '2':
# export the current <page>'s XML code
In this case I'd want to extract the XML code of only the second <page>
element, i.e. a string holding:
<page>
<title>User:Wonychifans</title>
<ns>2</ns>
<revision>
<text xml:space="preserve">blah blah</text>
</revision>
</page>
edit: minor typo and better mark-up