1

Python 3.4, parsing GB++ size XML Wikipedia dump files using etree.iterparse. I want to test within the current matched <page> element for its <ns> value, depending on the latter value I then want export the source XML of the whole <page> object and all its contents including any elements nested within it, i.e. the XML of a whole article.

I can iterate the <page> objects and find the ones I want, but then all available functions seem to want to read text/attribute values, whereas I simply want a utf8 string copy of the source file's XML code for the complete in scope <page> object. Is this possible?

A cut-down version of the XML looks like this:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xml:lang="en">
  <page>
    <title>Some Article</title>
    <ns>0</ns>
    <revision>
      <timestamp>2017-07-27T00:59:41Z</timestamp>
      <text xml:space="preserve">some text</text>
    </revision>
  </page>
  <page>
    <title>User:Wonychifans</title>
    <ns>2</ns>
    <revision>
      <text xml:space="preserve">blah blah</text>
    </revision>
  </page>
</mediawiki>

The python code getting me to the <ns> value test is here:

``from lxml import etree

# store namespace string for all elements (only one used in Wikipedia XML docs)
NAMESPACE = '{http://www.mediawiki.org/xml/export-0.10/}'
ns = {'wiki' : 'http://www.mediawiki.org/xml/export-0.10/'}

context = etree.iterparse('src.xml', events=('end',))
for event, elem in context:
  # at end of parsing each
  if elem.tag == (NAMESPACE+'page') and event == 'end':
    tagNs = elem.find('wiki:ns',ns)
    if tagNs is not None:
      nsValue = tagNs.text
      if nsValue == '2':
        # export the current <page>'s XML code

In this case I'd want to extract the XML code of only the second <page> element, i.e. a string holding:

  <page>
    <title>User:Wonychifans</title>
    <ns>2</ns>
    <revision>
      <text xml:space="preserve">blah blah</text>
    </revision>
  </page>

edit: minor typo and better mark-up

mwra
  • 317
  • 3
  • 11

2 Answers2

1

You can do this.

>>> from lxml import etree
>>> mediawiki = etree.iterparse('mediawiki.xml')
>>> page_content = {}
>>> for ev, el in mediawiki:
...     if el.tag=='page':
...         if page_content['ns']=='2':
...             print (page_content)
...         page_content = {}
...     else:
...         page_content[el.tag.replace('{http://www.mediawiki.org/xml/export-0.10/}', '')] = \
...             el.text.strip() if el.text else None
... 
>>> page_content
{'mediawiki': '', 'revision': '', 'timestamp': '2017-07-27T00:59:41Z', 'title': 'User:Wonychifans', 'page': '', 'text': 'blah blah', 'ns': '2'}

Because the structure of the output xml is quite simple there should be no difficulty in constructing it from the dictionary.

Edit: Although this approach requires two passes through the xml file it could be faster and it does recover the required xml.

First, look for the starting lines of the page elements.

>>> from lxml import etree
>>> mediawiki = etree.iterparse('mediawiki.xml', events=("start", "end"))
>>> for ev, el in mediawiki:
...     tag = el.tag[1+el.tag.rfind('}'):]
...     if ev=='start' and tag=='page':
...         keep=False
...     if ev=='start' and tag=='ns' and el.text=='2':
...         keep=True
...     if ev=='end' and tag=='page' and keep:
...         print (el.sourceline)
... 
10

The go through the xml again to find the complete page entries using the starting points.

>>> with open('mediawiki.xml') as mediawiki:
...     for _ in range(9):
...         r = next(mediawiki)
...     for line in mediawiki:
...         print (line.strip())
...         if '</page>' in line:
...             break
...         
<page>
<title>User:Wonychifans</title>
<ns>2</ns>
<revision>
<text xml:space="preserve">blah blah</text>
</revision>
</page>
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • Thanks but this doesn't help that much as I'd simplified the XML example and having re-build the XML will make the whole process slower/more complex than imagined (plus I've no idea how to do that). – mwra Nov 12 '17 at 22:52
  • Thanks. Surely the second bit of code needs to load the whole XML file and that is 65GB in this case. That is why I'm using iterparse (more correctly I've successfully used it for other task on c.10GB files of this type on an 8GB RAM MacBookAir). The other issue is, I think, that the code assumes all `` are the same length, so I think I need to iterate the list of `el.sourceline` values from the first pass. – mwra Nov 13 '17 at 18:29
  • No, the second bit of code reads the xml file line by line. It doesn't assume that the pages are the same length either. Notice that it skips over the first nine lines of the xml file and then outputs lines until it finds one that contains '' whereupon it stops outputting lines. What might be confusing is that you will need to write code that keep a count of lines read. In a loop it will skip until it reaches a starting point then output until it sees '', and continue with the beginning of the loop. – Bill Bell Nov 14 '17 at 15:47
  • I eventually got to a working solution but want to flag this as the accepted answer as it got mo to that point. – mwra Nov 28 '17 at 14:23
1

I've marked Bill Bell's answer as accepted as it was instrumental in me getting to my final solution, the core of which is below. The outer loop lets me loop though over 50 source XML files.

As some sources are large, the code checks in-loop for the copied source data exceeding 1GB. If so, a write of data to file occurs and the buffer string variable is purged. Otherwise all extracted data is written at the end of reading the source file(s).

Further polish would be to monitor the size of the output file and switch output sources once a given size were exceeded. In this case, it was easier to only scan some of the whole source set per run of the script.

I've removed some logging & print statements for brevity:

<!-- language: lang-python -->

import sys

dataSourceStr = '/Users/x/WP-data/'
outputDataStr = '/Users/x/WP-data/ns-data/'
headfile = open("header.txt","r")
headStr = headfile.read()
headfile.close()
footStr = '</mediawiki>'
matchCount = 0
strPage = ''
strPage = headStr
fileNum = 20 
nameSpaceValue = 4
startNum = 41 # starting file number
lastNum = 53 # ending file number
endNum = lastNum + 1
outputDataFile = outputDataStr + 'ns' + str(nameSpaceValue) + '.xml'

for fileNum in range (startNum , endNum):
  with open(dataSourceStr + str(fileNum) + '.xml') as mediawiki:
    lineNum = 44
    blnKeep = False
    strPage = ''
    strItem = ''
    loopMatchCount = 0
    for _ in range(lineNum):
      r = next(mediawiki)
    for line in mediawiki:
      if '<ns>' + str(nameSpaceValue) + '</ns>' in line:
        blnKeep = True
        matchCount = matchCount + 1
        loopMatchCount = loopMatchCount + 1
      strItem = strItem + line
      lineNum = lineNum + 1
      if '</page>' in line:
        if blnKeep:
          strPage = strPage + strItem
          strItem = ''
          blnKeep = False
          strPageSize = sys.getsizeof(strPage)
          if strPageSize > 1073741824:
            file = open(outputDataFile,"a")
            file.write(strPage)
            file.close()
            strPage = ''
        else:
          strItem = ''

  mediawiki.close
  file = open(outputDataFile,"a")
  file.write(strPage)
  file.close()

file = open(outputDataFile,"a")
file.write(footStr)
file.close()

I'm sure this could be more elegant but I hope this helps any fellow non-experts arriving here and trying to do this sort of thing.

mwra
  • 317
  • 3
  • 11