I have a script that takes a bunch of XML files, all in the form of: HMDB61152.xml
and pulls them all in using glob
. For each file I need to pull some details about each, such as accession
, name
, and a list of diseases
. To parse through each XML I used xmltodict
because I traditionally like working with lists instead of XML files, although I may need to change my strategy due to the issues I am facing.
I am able to pull name
and acc
easily since all XML files have it in the same first level of the tree:
path = '/Users/me/Downloads/hmdb_metabolites'
for data_file in glob.glob(os.path.join(path,'*.xml')):
diseases=[]
with open(data_file) as fd:
doc = xmltodict.parse(fd.read())
name = doc['metabolite']['name']
acc = doc['metabolite']['accession']
So basically at this point there are three options for the disease information:
- There are multiple
disease
tags within eachdiseases
tree. I.e there are 2 or more diseases for the given accession. - There is one
disease
within thediseases
tree meaning the accession has only one disease. or - There are no
disease
in thediseases
tree at all.
I need to write a loop that can handle any three cases, and thats where I am failing. Here is my approach so far:
#I get the disease root, which returns True if it has lower level items (one or more disease within diseases)
#or False if there are no disease within diseases.
dis_root=doc['metabolite']['diseases']
if (bool(dis_root)==True):
dis_init = doc['metabolite']['diseases']['disease']
if (bool(doc['metabolite']['diseases']['disease'][0]) == True):
for x in range(0,len(dis_init)):
diseases.append(doc['metabolite']['diseases']['disease'][x]['name'])
else:
diseases.append(doc['metabolite']['diseases']['disease']['name'])
else:
diseases=['None']
So the problem is, for the case where there are multiple diseases, I need to pull their names in the following format: doc['metabolite']['diseases']['disease'][x]['name']
for each x in diseases. But for the ones that have only one disease, they have no index at all, so the only way I can pull the name of that one disease is by doing doc['metabolite']['diseases']['disease']['name']
.
The script is failing because as soon as we encounter a case of only one disease, it returns a KeyError when it tries to test if doc['metabolite']['diseases']['disease'][0]) == True
. If anyone can help me figure this out that'd be great, or direct me to a more appropriate strategy.