I want to read an XML file with metadata and extract specific parts and then write it to another file. However I'm stuck at the beginning of parsing the 2MB metadata XML file.
For testing and debugging purposes I've narrowed the input file down to this smaller sample XML below.
<?xml version="1.0" encoding="UTF-8"?>
<ODM Description="Study Metadata" xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:OpenClinica="http://www.openclinica.org/ns/odm_ext_v130/v3.1" >
<Study OID="MyStudy">
<GlobalVariables>
<StudyName>MyStudy</StudyName>
<ProtocolName>MyProtocol</ProtocolName>
</GlobalVariables>
<BasicDefinitions>
<MeasurementUnit OID="MU_CM" Name="cm">
<Symbol>
<TranslatedText>cm</TranslatedText>
</Symbol>
</MeasurementUnit>
<MeasurementUnit OID="MU_KG" Name="kg">
<Symbol>
<TranslatedText>kg</TranslatedText>
</Symbol>
</MeasurementUnit>
</BasicDefinitions>
<MetaDataVersion OID="v1.0.0" Name="MetaDataVersion_v1.0.0">
<Protocol>
<StudyEventRef StudyEventOID="SE_BASELINE" OrderNumber="1" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_3WK" OrderNumber="2" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_6WK" OrderNumber="3" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_9WK" OrderNumber="4" Mandatory="Yes"/>
<StudyEventRef StudyEventOID="SE_12WK" OrderNumber="5" Mandatory="Yes"/>
</Protocol>
<ItemDef OID="I_MYSTUDY_B_BL_D_VDATE" Name="BL_D_VISITDATE" DataType="date" SASFieldName="BL_D_VDA" Comment="Visit date" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
<Question>
<TranslatedText>Visit date</TranslatedText>
</Question>
</ItemDef>
<ItemDef OID="I_MYSTUDY_B_BL_D_VCODE" Name="BL_D_MEDCODE" DataType="integer" Length="1" SASFieldName="BL_D_MCO" Comment="Medicine code" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
<Question>
<TranslatedText>Medicine code</TranslatedText>
</Question>
<CodeListRef CodeListOID="CL_12345"/>
</ItemDef>
</MetaDataVersion>
</Study>
</ODM>
I'm just interested in the ItemDef
elements and their properties, and I'm using xml.etree.ElementTree
to parse the XML file. Here is what I've got so far, however it never reaches the part with -- found ItemDef
, see code below.
# which file to read
FILE_NAME = "mystudy.xml"
ns = {'d': 'http://www.cdisc.org/ns/odm/v1.3'}
# Import the os module
import os
import xml.etree.ElementTree as ET
import csv
import array as arr
e = ET.parse(os.path.join(os.getcwd(), FILE_NAME))
root = e.getroot()
# testing to see if it is parses anything
print(root.get('Description'))
namespace = "{http://www.cdisc.org/ns/odm/v1.3}"
# none of this seems to work..
# col = e.findall('ItemDef')
# col = e.findall('.//ItemDef')
# col = e.findall('(*)ItemDef')
# col = e.findall('{0}ODM/Study/MetaDataVersion/ItemDef'.format(namespace))
col = e.findall('{0}ODM/{0}Study/{0}MetaDataVersion/{0}ItemDef'.format(namespace))
print("start for-loop")
# iterate all
for itemdef in col:
name = itemdef.get('Name')
print("-- found ItemDef name=", name)
print("finished for-loop")
As I understand it you have to specify the namespace correctly, else it will just read nothing, that is probably the error. I've searched similar questions on stackoverflow.com and tried several things (see comments in code) but it's not working correctly.
- How do I parse the file correctly, what is going wrong in the above code?
- Is there a way to debug and test for the correct namespace for each item?
- Or is there a way to make it ignore the namespace and just read the elements as-is?