I want to parse a huge file xml-file. The records in this huge file do look for example like this. And in general the file looks like this
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
record_1
...
record_n
</dblp>
I wrote some code, that shall get me a selection of recordings from this file.
If I let the code run (takes nearly 50 Minutes including storage in the MySQL-Database) I notice, that there is a record, which seams to have nearly a million authors. This must be wrong. I even checked up on it by looking into the file make sure, that the file has no errors in it. The paper has only 5 or 6 authors, so all is fine wirh dblp.xml. So I assume a logic error in my code. But I can't figure out where this could be. Perhaps someone can tell me, where the error is?
The code stops in the line if len(auth) > 2000
.
import sys
import MySQLdb
from lxml import etree
elements = ['article', 'inproceedings', 'proceedings', 'book', 'incollection']
tags = ["author", "title", "booktitle", "year", "journal"]
def fast_iter(context, cursor):
mydict = {} # represents a paper with all its tags.
auth = [] # a list of authors who have written the paper "together".
counter = 0 # counts the papers
for event, elem in context:
if elem.tag in elements and event == "start":
mydict["element"] = elem.tag
mydict["mdate"] = elem.get("mdate")
mydict["key"] = elem.get("key")
elif elem.tag == "title" and elem.text != None:
mydict["title"] = elem.text
elif elem.tag == "booktitle" and elem.text != None:
mydict["booktitle"] = elem.text
elif elem.tag == "year" and elem.text != None:
mydict["year"] = elem.text
elif elem.tag == "journal" and elem.text != None:
mydict["journal"] = elem.text
elif elem.tag == "author" and elem.text != None:
auth.append(elem.text)
elif event == "end" and elem.tag in elements:
counter += 1
print counter
#populate_database(mydict, auth, cursor)
mydict.clear()
auth = []
if mydict or auth:
sys.exit("Program aborted because auth or mydict was not deleted properly!")
if len(auth) > 200: # There are up to ~150 authors per paper.
sys.exit("auth: It seams there is a paper which has too many authors.!")
if len(mydict) > 50: # A paper can have much metadata.
sys.exit("mydict: It seams there is a paper which has too many tags.")
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def main():
cursor = connectToDatabase()
cursor.execute("""SET NAMES utf8""")
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("start", "end"))
fast_iter(context, cursor)
cursor.close()
if __name__ == '__main__':
main()
EDIT:
I was totally misguided, when I wrote this function. I made a huge mistake by overlooking, that while trying to skip some unwanted recordings the get messed up with some wanted recordings. And at a certain point in the file, where I skiped nearly a million records in a row, the following wanted record got blown up.
With the help of John and Paul I managed to rewrite my code. It is parsing right now, and seams to do it well. I'll report back, if some unexpected errors remained unsolved. Elsewise thank you all for your help! I really appreciated it!
def fast_iter2(context, cursor):
elements = set([
'article', 'inproceedings', 'proceedings', 'book', 'incollection',
'phdthesis', "mastersthesis", "www"
])
childElements = set(["title", "booktitle", "year", "journal", "ee"])
paper = {} # represents a paper with all its tags.
authors = [] # a list of authors who have written the paper "together".
paperCounter = 0
for event, element in context:
tag = element.tag
if tag in childElements:
if element.text:
paper[tag] = element.text
# print tag, paper[tag]
elif tag == "author":
if element.text:
authors.append(element.text)
# print "AUTHOR:", authors[-1]
elif tag in elements:
paper["element"] = tag
paper["mdate"] = element.get("mdate")
paper["dblpkey"] = element.get("key")
# print tag, element.get("mdate"), element.get("key"), event
if paper["element"] in ['phdthesis', "mastersthesis", "www"]:
pass
else:
populate_database(paper, authors, cursor)
paperCounter += 1
print paperCounter
paper = {}
authors = []
# if paperCounter == 100:
# break
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
del context