0

I am parsing xml files on a linux ubuntu machine using a python script and the cElementTree package. After a while (at the same point every time) it results in the error

Segmentation fault (core dumped)

This seems to be a C error and hence I think its connected to the C-library I am using (cElementTree). However, I am a bit stuck in how to debug this. If I run the same program on my local Macbook, it works fine without any problem. Only on the linux server does it crash? How can I debug this? Does anybody know about problems of cElementTree in linux?

Here is my code

import xml.etree.cElementTree as ET
def fill_pubmed_papers_table(list_of_files):
    for f in list_of_files:
        print "read file %s" % f
        inF = gzip.open(f, 'rb')
        tree = ET.parse(inF)
        inF.close()
        root = tree.getroot()
        papers = root.findall('PubmedArticle')
        root.clear()
        for i, citation in enumerate(papers):
            write_to_db(citation)
    return 

the parsing script write_to_db() is fairly long, but I can make it available if anybody is interested.

carl
  • 4,216
  • 9
  • 55
  • 103

1 Answers1

0

ok not sure whether it will help anyone, but I found the cause of the set fault. It was not actually connected to cElementTree, but connected to the file read in. I do not completely understand why this happened, but my code works fine if I delete the papers list at the end of the loop meaning I changed the code to

def fill_pubmed_papers_table(list_of_files):
    for i, f in enumerate(list_of_files):
        print "read file %d names %s" % (i, f)
        inF = gzip.open(f, 'rb')
        tree = ET.parse(inF)
        inF.close()
        root = tree.getroot()
        papers = root.findall('PubmedArticle')
        print "number of papers = ", len(papers)
        # we don't need anything from root anymore
        root.clear()
        for citation in papers:
            write_to_db(citation)
        # If I do not release memory here I get segfault on the linux server
        del papers
        gc.collect()
    return 

I also added the garbage collector just in case, but its not actually needed... deleting the papers list is what solved the problem... I guess it has to do with memory(?)

carl
  • 4,216
  • 9
  • 55
  • 103