Python : How to navigate XML sub-nodes efficiently?

Question

I am trying to extract certain data points from XML and have tried two options...

Working with XML format using ElementTree
Working with Dictionary using xmltodict

Here's what I have got so far,

Code

# Packages
# --------------------------------------
import xml.etree.ElementTree as ET

# XML Data
# --------------------------------------
message_xml = \
'<ClinicalDocument> \
    <code code="34133-9" displayName="Summarization of Episode Note"/> \
    <title>Care Summary</title> \
    <recordTarget> \
        <patientRole> \
            <id assigningAuthorityName="LOCAL" extension="L123456"/> \
            <id assigningAuthorityName="SSN" extension="788889999"/> \
            <id assigningAuthorityName="GLOBAL" extension="G123456"/> \
            <addr use="HP"> \
                <streetAddressLine>1000 N SOME AVENUE</streetAddressLine> \
                <city>BIG CITY</city> \
                <state>NA</state> \
                <postalCode>12345-1010</postalCode> \
                <country>US</country> \
            </addr> \
            <telecom nullFlavor="NI"/> \
            <patient> \
                <name use="L"> \
                    <given>JANE</given> \
                    <given>JOE</given> \
                    <family>DOE</family> \
                </name> \
            </patient> \
        </patientRole> \
    </recordTarget> \
</ClinicalDocument>'

# Get Tree & Root
# --------------------------------------
tree = ET.ElementTree(ET.fromstring(message_xml))
root = tree.getroot()

# Iterate
# --------------------------------------
for node in root:

    tag = node.tag
    attribute = node.attrib

    # Get ClinicalDocument.code values
    if tag == 'code':
        document_code_code = attribute.get('code')
        document_code_name = attribute.get('displayName')

    else:
        pass

    # Get ClinicalDocument.recordTarget values
    if tag == 'recordTarget':

        for child in node.iter():

            # Multiple <id> tags
            record_target_local = ??
            record_target_ssn = ??
            record_target_global = ??

            # Multiple <given> tags
            record_target_name_first = ??
            record_target_name_middle = ??
            record_target_name_last = ??

    else:
        pass

Expected Output

document_code,document_name,id_local,id_ssn,id_global,name_first, name_middle,name_last
34133-9,Summarization of Episode Note,L123456,788889999,G123456,JANE,JOE,DOE

Acceptable Output

document_code,document_name,id_type,id,name_first,name_middle,name_last
34133-9,Summarization of Episode Note,LOCAL,L123456,JANE,JOE,DOE
34133-9,Summarization of Episode Note,SSN,788889999,JANE,JOE,DOE
34133-9,Summarization of Episode Note,GLOBAL,G123456,JANE,JOE,DOE

Questions

How to efficiently navigate child-nodes with multiple child-nodes under them?
How to handle duplicate tags (ex: <id>, <given>)?

In the code you look up elements in a namespace (`urn:hl7-org:v3`), but the XML document (`message_xml`) does not use any namespaces. — mzjn, Apr 18 '19 at 04:43
@mzjn, Thanks for noticing. The actual document has namespace but i cleaned it up before posting to make it easier to read. — WeShall, Apr 18 '19 at 15:02
Are you working with large XML files where memory might be an issue? — Daniel Haley, Apr 19 '19 at 20:16
@DanielHaley : A few Kb(s) to about an Mb, not too big. I don’t think memory should be an issue. — WeShall, Apr 22 '19 at 14:25
@WeShall - What would start a new row in your output? Could there be more than one `recordTarget`? Could there be more than one `ClinicalDocument` (with some other root element)? It would be easy to give you an answer with the exact output from the exact input, but that might not help you when you try to apply it to your real data. If your input and output would really be like that (with just a single row output), that's ok too and I can add an answer with the information you've already given. — Daniel Haley, Apr 22 '19 at 17:29
@DanielHaley : A document will always have one and only one `ClinicalDocument` but **can** have multiple `recordTarget` tags and multiple `id` for each `recordTarget`. Each new `id` starts a new row (ton of duplicates I know but will give end user flexibility to lookup by any id on-hand). I updated the question with what can be a perfectly acceptable output as well. Thanks for looking. — WeShall, Apr 22 '19 at 18:25
@DanielHaley, your answer definitely helped. Got me in right direction. I am trying to implement your answer using lxml & XPath. — WeShall, Apr 23 '19 at 17:36

score 3 · Accepted Answer · answered Apr 22 '19 at 19:09

How to efficiently navigate child-nodes with multiple child-nodes under them?

A good way to navigate XML is with XPath. ElementTree has limited XPath support, but it appears good enough for what you need. If you end up needing to use more complicated XPath, I'd suggest using XPath in lxml.

How to handle duplicate tags (ex: <id>, <given>)?

It depends on what you need to do with those elements. For example, if you want separate rows for each id element, you'd need to iterate over each one (with findall() in ElementTree or xpath() in lxml).

If you just want a value (either text or an attribute value), you need to narrow it down to a single element in the XPath.

For example, an id element that has an assigningAuthorityName attribute value equal to LOCAL would be id[@assigningAuthorityName='LOCAL'].

The given element is a little trickier; how can you tell one is the first name and one is the middle name? The only way I can see is position; the first given (given[1]) is the first name and the second given (given[2]) is the second name. Are you guaranteed to always have two given elements? If not, you may need to do some checking or try/except statements to get the needed output.

Also, since you're creating csv output, I'd recommend using the csv module; specifically DictWriter.

This will allow you to store the values from the XML in a dict to write rows. You can create new copies of the dict for new rows while maintaining common values (like document_code and document_name).

Here's an example that will create a new row for each recordTarget.

XML Input (input.xml)

<ClinicalDocument> 
    <code code="34133-9" displayName="Summarization of Episode Note"/> 
    <title>Care Summary</title> 
    <recordTarget> 
        <patientRole> 
            <id assigningAuthorityName="LOCAL" extension="L123456"/> 
            <id assigningAuthorityName="SSN" extension="788889999"/> 
            <id assigningAuthorityName="GLOBAL" extension="G123456"/> 
            <addr use="HP"> 
                <streetAddressLine>1000 N SOME AVENUE</streetAddressLine> 
                <city>BIG CITY</city> 
                <state>NA</state> 
                <postalCode>12345-1010</postalCode> 
                <country>US</country> 
            </addr> 
            <telecom nullFlavor="NI"/> 
            <patient> 
                <name use="L"> 
                    <given>JANE</given> 
                    <given>JOE</given> 
                    <family>DOE</family> 
                </name> 
            </patient> 
        </patientRole> 
    </recordTarget>
</ClinicalDocument>

Python

import csv
import xml.etree.ElementTree as ET
from copy import deepcopy

values_template = {"document_code": "", "document_name": "", "id_local": "", "id_ssn": "",
                   "id_global": "", "name_first": "", "name_middle": "", "name_last": ""}

with open("output.csv", "w", newline="") as csvfile:
    csvwriter = csv.DictWriter(csvfile, delimiter=",", quoting=csv.QUOTE_MINIMAL,
                               fieldnames=[name for name in values_template])
    csvwriter.writeheader()

    tree = ET.parse('input.xml')

    values_template["document_code"] = tree.find("code").get("code")
    values_template["document_name"] = tree.find("code").get("displayName")

    for target in tree.findall("recordTarget"):

        values = deepcopy(values_template)

        values["id_local"] = target.find("patientRole/id[@assigningAuthorityName='LOCAL']").get("extension")
        values["id_ssn"] = target.find("patientRole/id[@assigningAuthorityName='SSN']").get("extension")
        values["id_global"] = target.find("patientRole/id[@assigningAuthorityName='GLOBAL']").get("extension")
        values["name_first"] = target.find("patientRole/patient/name/given[1]").text
        values["name_middle"] = target.find("patientRole/patient/name/given[2]").text
        values["name_last"] = target.find("patientRole/patient/name/family").text

        csvwriter.writerow(values)

CSV Output (output.csv)

document_code,document_name,id_local,id_ssn,id_global,name_first,name_middle,name_last
34133-9,Summarization of Episode Note,L123456,788889999,G123456,JANE,JOE,DOE

Here's another example that will create a new row for each recordTarget/patientRole/id...

Python

import csv
import xml.etree.ElementTree as ET
from copy import deepcopy

values_template = {"document_code": "", "document_name": "", "id": "",
                   "name_first": "", "name_middle": "", "name_last": ""}

with open("output.csv", "w", newline="") as csvfile:
    csvwriter = csv.DictWriter(csvfile, delimiter=",", quoting=csv.QUOTE_MINIMAL,
                               fieldnames=[name for name in values_template])
    csvwriter.writeheader()

    tree = ET.parse('input.xml')

    values_template["document_code"] = tree.find("code").get("code")
    values_template["document_name"] = tree.find("code").get("displayName")

    for target in tree.findall("recordTarget"):

        values = deepcopy(values_template)

        values["name_first"] = target.find("patientRole/patient/name/given[1]").text
        values["name_middle"] = target.find("patientRole/patient/name/given[2]").text
        values["name_last"] = target.find("patientRole/patient/name/family").text

        for role_id in target.findall("patientRole/id"):
            values["id"] = role_id.get("extension")
            csvwriter.writerow(values)

CSV Output (output.csv)

document_code,document_name,id,name_first,name_middle,name_last
34133-9,Summarization of Episode Note,L123456,JANE,JOE,DOE
34133-9,Summarization of Episode Note,788889999,JANE,JOE,DOE
34133-9,Summarization of Episode Note,G123456,JANE,JOE,DOE

Thanks, this is great. Truly appreciate it. – WeShall Apr 24 '19 at 00:13 — WeShall, Apr 24 '19 at 00:13

Python : How to navigate XML sub-nodes efficiently?

1 Answers1