Cannot pull data from XML files due to differences in format

Question

I have a script that takes a bunch of XML files, all in the form of: HMDB61152.xml and pulls them all in using glob. For each file I need to pull some details about each, such as accession, name, and a list of diseases. To parse through each XML I used xmltodict because I traditionally like working with lists instead of XML files, although I may need to change my strategy due to the issues I am facing.

I am able to pull name and acc easily since all XML files have it in the same first level of the tree:

path = '/Users/me/Downloads/hmdb_metabolites' for data_file in glob.glob(os.path.join(path,'*.xml')): diseases=[] with open(data_file) as fd: doc = xmltodict.parse(fd.read()) name = doc['metabolite']['name'] acc = doc['metabolite']['accession']

So basically at this point there are three options for the disease information:

There are multiple disease tags within each diseases tree. I.e there are 2 or more diseases for the given accession.
There is one disease within the diseases tree meaning the accession has only one disease. or
There are no disease in the diseases tree at all.

I need to write a loop that can handle any three cases, and thats where I am failing. Here is my approach so far:

    #I get the disease root, which returns True if it has lower level items (one or more disease within diseases) 
#or False if there are no disease within diseases. 
    dis_root=doc['metabolite']['diseases']
    if (bool(dis_root)==True):
        dis_init = doc['metabolite']['diseases']['disease']
        if (bool(doc['metabolite']['diseases']['disease'][0]) == True):
            for x in range(0,len(dis_init)):
                diseases.append(doc['metabolite']['diseases']['disease'][x]['name'])
        else: 
            diseases.append(doc['metabolite']['diseases']['disease']['name'])

    else:
        diseases=['None']

So the problem is, for the case where there are multiple diseases, I need to pull their names in the following format: doc['metabolite']['diseases']['disease'][x]['name'] for each x in diseases. But for the ones that have only one disease, they have no index at all, so the only way I can pull the name of that one disease is by doing doc['metabolite']['diseases']['disease']['name'].

The script is failing because as soon as we encounter a case of only one disease, it returns a KeyError when it tries to test if doc['metabolite']['diseases']['disease'][0]) == True. If anyone can help me figure this out that'd be great, or direct me to a more appropriate strategy.

score 0 · Answer 1 · answered Dec 04 '16 at 21:45

0

Try something like

if 0 in doc['metabolite']['diseases']['disease']:
    pass # if 0 is a key in the array, we have multiple entries
else
    pass # only a single item.

answered Dec 04 '16 at 21:45

Simon Callan

3,020
1
23
34

score 0 · Answer 2 · answered Dec 04 '16 at 23:29

Found a relatively easy workaround, I simply use try in the following way:

try:
            for x in range(0,len(dis_init)):
                    diseases.append(doc['metabolite']['diseases']['disease'][x]['name'])
            except KeyError: 
                diseases.append(doc['metabolite']['diseases']['disease']['name'])

Cannot pull data from XML files due to differences in format

2 Answers2