cElementTree to extract data from XML python

Question

I have an XML file whose structure is similar to the following:

<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.0" exported-on="2017-12-20">
    <drug type="biotech" created="2005-06-13" updated="2017-11-06">
        <drugbank-id primary="true">DB00001</drugbank-id>
        <drugbank-id>BTD00024</drugbank-id>
        <drugbank-id>BIOD00024</drugbank-id>
        <cas-number>138068-37-8</cas-number>
        <name>Lepirudin</name>
    </drug>
    <drug type="biotech" created="2005-06-13" updated="2017-11-06">
        <drugbank-id primary="true">DB00045</drugbank-id>
        <drugbank-id>BTD00054</drugbank-id>
        <drugbank-id>BIOD00054</drugbank-id>
        <cas-number>205923-56-4</cas-number>
        <name>Lyme disease vaccine (recombinant OspA)</name>
    </drug>
</drugbank>

I am trying to utilize cElementTree module of Python 3. I would like to extract the name of each drug in this XML, for which I have written the following code:

import xml.etree.cElementTree as ET

tree = ET.parse('fulldatabase.xml')
drugbank = tree.getroot()

print(drugbank.tag)

for drug in drugbank:
    print(drug.find('name').text)

The error I get is AttributeError: 'NoneType' object has no attribute 'text'

I have also tried checking this but the answer the OP wrote in it did not work for me. Is there any way to get name and cas-number field out of each drug. I have tried some combinations like removing findall() in the for loop condition, but things did not work for me even then.

eagle · Accepted Answer · 2018-04-11T15:57:19.740

2

Do you need anything besides the name? If not this will do it. You're not using the xml namespace properly as defined in the <drugbank xmlns="http://www.drugbank.ca" portion of the file

for drug in drugbank.iter('{http://www.drugbank.ca}name'):
    print drug.text

Lepirudin
Lyme disease vaccine (recombinant OspA)

Here's another way to get the elements you need:

for child in drugbank.getchildren():
    print {'cas-number': child.find('{http://www.drugbank.ca}cas-number').text, 'name': child.find('{http://www.drugbank.ca}name').text}

{'cas-number': '138068-37-8', 'name': 'Lepirudin'}
{'cas-number': '205923-56-4', 'name': 'Lyme disease vaccine (recombinant OspA)'}

edited Apr 11 '18 at 15:57

answered Apr 11 '18 at 15:24

eagle

872
5
14

Yes, there are a few other things in the XML I would like to have. Which is why I posted a shortened XML over here – Sparker0i Apr 11 '18 at 15:31
For [this](https://www.drugbank.ca/drugs/DB01048.xml) example, I can replicate above to get cas-number, but to get the information in the drug-interactions section (e.g. a list of interacting drug ids), why do neither of these codes work? `print {'drug interactions':child.find('{http://www.drugbank.ca}drug-interaction/name').text}` (or variations of that line) or `for node in drugbank.findall('.//drug-interactions/drug-interaction/drugbank-id'): print node`? For some tag names (e.g. 'name' could appear multiple times in different contexts), is there a way to be more specific about tag path? – Slowat_Kela Apr 17 '18 at 21:28

cElementTree to extract data from XML python

1 Answers1