2

I want to read an XML file with metadata and extract specific parts and then write it to another file. However I'm stuck at the beginning of parsing the 2MB metadata XML file.

For testing and debugging purposes I've narrowed the input file down to this smaller sample XML below.

<?xml version="1.0" encoding="UTF-8"?>
<ODM Description="Study Metadata" xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:OpenClinica="http://www.openclinica.org/ns/odm_ext_v130/v3.1" >
    <Study OID="MyStudy">
        <GlobalVariables>
            <StudyName>MyStudy</StudyName>
            <ProtocolName>MyProtocol</ProtocolName>
        </GlobalVariables>
        <BasicDefinitions>
            <MeasurementUnit OID="MU_CM" Name="cm">
                <Symbol>
                    <TranslatedText>cm</TranslatedText>
                </Symbol>
            </MeasurementUnit>
            <MeasurementUnit OID="MU_KG" Name="kg">
                <Symbol>
                    <TranslatedText>kg</TranslatedText>
                </Symbol>
            </MeasurementUnit>
        </BasicDefinitions>
        <MetaDataVersion OID="v1.0.0" Name="MetaDataVersion_v1.0.0">
            <Protocol>
                <StudyEventRef StudyEventOID="SE_BASELINE" OrderNumber="1" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_3WK" OrderNumber="2" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_6WK" OrderNumber="3" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_9WK" OrderNumber="4" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_12WK" OrderNumber="5" Mandatory="Yes"/>
            </Protocol>
            <ItemDef OID="I_MYSTUDY_B_BL_D_VDATE" Name="BL_D_VISITDATE" DataType="date" SASFieldName="BL_D_VDA" Comment="Visit date" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
                <Question>
                    <TranslatedText>Visit date</TranslatedText>
                </Question>
            </ItemDef>
            <ItemDef OID="I_MYSTUDY_B_BL_D_VCODE" Name="BL_D_MEDCODE" DataType="integer" Length="1" SASFieldName="BL_D_MCO" Comment="Medicine code" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
                <Question>
                    <TranslatedText>Medicine code</TranslatedText>
                </Question>
                <CodeListRef CodeListOID="CL_12345"/>
            </ItemDef>
        </MetaDataVersion>
    </Study>
</ODM>

I'm just interested in the ItemDef elements and their properties, and I'm using xml.etree.ElementTree to parse the XML file. Here is what I've got so far, however it never reaches the part with -- found ItemDef, see code below.

# which file to read
FILE_NAME = "mystudy.xml"
ns = {'d': 'http://www.cdisc.org/ns/odm/v1.3'}

# Import the os module
import os
import xml.etree.ElementTree as ET
import csv
import array as arr

e = ET.parse(os.path.join(os.getcwd(), FILE_NAME))
root = e.getroot()

# testing to see if it is parses anything
print(root.get('Description'))

namespace = "{http://www.cdisc.org/ns/odm/v1.3}"

# none of this seems to work..
# col = e.findall('ItemDef')
# col = e.findall('.//ItemDef')
# col = e.findall('(*)ItemDef')
# col = e.findall('{0}ODM/Study/MetaDataVersion/ItemDef'.format(namespace))
col = e.findall('{0}ODM/{0}Study/{0}MetaDataVersion/{0}ItemDef'.format(namespace))

print("start for-loop")
# iterate all
for itemdef in col:
    name = itemdef.get('Name')
    print("-- found ItemDef name=", name)

print("finished for-loop")

As I understand it you have to specify the namespace correctly, else it will just read nothing, that is probably the error. I've searched similar questions on stackoverflow.com and tried several things (see comments in code) but it's not working correctly.

  • How do I parse the file correctly, what is going wrong in the above code?
  • Is there a way to debug and test for the correct namespace for each item?
  • Or is there a way to make it ignore the namespace and just read the elements as-is?
martineau
  • 119,623
  • 25
  • 170
  • 301
BdR
  • 2,770
  • 2
  • 17
  • 36

1 Answers1

2

Since e starts on the root tag, remove <ODM> from XPath expression:

col = e.findall('./{0}Study/{0}MetaDataVersion/{0}ItemDef'.format(namespace))

# Study Metadata
# start for-loop
# -- found ItemDef name= BL_D_VISITDATE
# -- found ItemDef name= BL_D_MEDCODE
# finished for-loop

Even better, use namespaces argument of findall using the dictionary you define to map to d prefix:

ns = {'d': 'http://www.cdisc.org/ns/odm/v1.3'}

col = e.findall('./d:Study/d:MetaDataVersion/d:ItemDef', namespaces=ns)

# SHORT-HAND FOR ANYWHERE SEARCH
col = e.findall('.//d:ItemDef', namespaces=ns)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks it works, and I can get to the properties `Name` etc. But once you have an `itemdef`, do you have any hint how to get the `TranslatedText` or the `CodeListRef/@CodeListOID` ? – BdR Jun 24 '21 at 11:33
  • Nvm I think I've got it, it's `itemdef.find('d:Question/d:TranslatedText', namespaces=ns).text` and `itemdef.find('d:CodeListRef', namespaces=ns).get('CodeListOID')` thanks again – BdR Jun 24 '21 at 11:43