1

I am attempting to use xmltodict to parse through XML in to hopes of eventually converting to a more readable table format for others. I have been able to get through most of the XML but when I come to an element with multiple subelements, I feel I am chasing my tail. My hope is to use panda with the values I extract from the XML...

Here is a sanitized version of the XML I am attempting to parse:

  <batchConfiguration>
    <batchJob name="BATCHJOB1">
      <className>batchJob1</className>
      <schedule>Y</schedule>
      <interval>300</interval>
      <systemControlled>N</systemControlled>
    </batchJob>
    <batchJob name="BATCHJOB2">
      <params>
        <param name="QueueName1">batchQueue1</param>
      </params>
      <className>batchJob2</className>
      <startTime>02:10:00</startTime>
      <schedule>N</schedule>
      <daysOfTheWeek>YYYYYYY</daysOfTheWeek>
      <systemControlled>N</systemControlled>
    </batchJob>
    <batchJob name="BATCHJOB3">
      <params>
        <param name="ignoreErrors">Y</param>
        <param name="batchSize">1000</param>
      </params>
      <className>classyBatchJob</className>
      <schedule>Y</schedule>
      <interval>90</interval>
      <systemControlled>N</systemControlled>
    </batchJob>
  </batchConfiguration>

My thought was I could somehow loop through the lines where there are multiple "params". I can return a single line of "params" but stumped when there are multiple. Here is my code to date. It has pieces parts where I try to figure things as I go. The XML is read from a file...

import xmltodict as xml
import pprint

#File to parse
fileptr=open(r"FileIRead.xml")

# Show raw XML text file data
raw_file= fileptr.read()
# print(raw_file)

# Create an XML dictionary
xml_dict=xml.parse(raw_file)
pprint.pprint(xml_dict)

xml_dict1=xml.parse(raw_file)['batchConfiguration']['batchJob']
pprint.pprint(xml_dict1)
# pprint.pprint(xml_dict['batchConfiguration']['batchJob'])

# https://docs.python.org/3/tutorial/errors.html

for bJ in xml_dict1:
    bJName=bJ['@name']
    print(f"Name: {bJ['@name']}")
    print(bJName)
    try:
        print(f"Interval: {bJ['interval']}")
    except:
        print("Interval: N/A")
    try:
        print(f"Scheduled: {bJ['schedule']}")
    except:
        print("N/A")
    try:
        print(f"Start Time: {bJ['startTime']}")
    except:
        print("Start Time: N/A")
    try:
        print(f"End Time: {bJ['endTime']}")
    except:
        print("End Time: N/A")
    try:
        # This works fine to return only a single element. With multiple it fails.
        print(f"Params: {bJ['params']['param']['@name']} - {bJ['params']['param']['#text']}")
    except:
        print("Params: N/A")
    try:
        print(f"Classname: {bJ['className']}")
    except:
        print("Classname: N/A")
    try:
        print(f"DaysOfWeek: {bJ['daysOfTheWeek']}")
    except:
        print("DaysOfWee: N/A")
    try:
        # Attempt to get all parameters single or multiple
        xml_dict2=xml.parse(raw_file)['params']['param']
        pprint.pprint(xml_dict2)
        for bJ1 in xml_dict2['params']['param']:
            print(f"--- {bJ1['@name']}")
    except:
        print("It no worky")

Edit: By request... The output I have been able to get is:

Name: BATCHJOB1
Classname: batchJob1
... (etc)

My end goal is to take the output and put it into column format something like this:

Name            Classname    ...
BATCHJOB1       batchJob1

"N/A" would be placed where the element does not exist or has no value.

gritts
  • 177
  • 1
  • 3
  • 13

2 Answers2

1

xmltodict is only returning a dict when it is one parameter, but a list when it is two or more. There is a force_list parameter to .parse that allows keys to be indicated that should always be lists.

You could use:

xml_dict1 = xml.parse(raw_file, force_list=('param',))['batchConfiguration']['batchJob']

Then:

try:
    for p in bJ['params']['param']:
        print(f"Params: {p['@name']} - {p['#text']}")
except KeyError: # recommend never use bare 'except'
    print("Params: N/A")
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thank you, this does output the subelement values. As a side, I was not sure I could leave KeyError blank so I did not add them to my except lines. – gritts Jan 31 '23 at 23:17
  • 1
    @gritts Using the error expected makes sure you don't accidently ignore other errors. For example, `NameError` is raised if you misspell a variable (`bj` vs. `bJ`) but would be silently ignored. – Mark Tolonen Jan 31 '23 at 23:24
  • This answer worked best for me in terms of the information returned. I was able to capture the subelement values along with their corresponding element values. – gritts Feb 06 '23 at 14:59
1

If I understand you correctly, this can be accomplished by using pandas.read_xml():

import pandas as pd
pd.read_xml([your_xml]).iloc[:,0:2]

Output, based on your sample xml:

      name      className
0   BATCHJOB1   batchJob1
1   BATCHJOB2   batchJob2
2   BATCHJOB3   classyBatchJob
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Not what I was thinking, honestly didn't know you could do that. I am thinking of using panda to format the results where I can export them to a .csv file. (eventually) I should say using panda here was a means to learn more new stuff. Thank you, I will look at this as an option as well. – gritts Jan 31 '23 at 23:15
  • I worked with this solution a bit and prefer how this outputs the XML converted to tables. Further tinkering I was able to return as comma delimeted values. I would like this to return the subelements (params) as well but not sure how to go about that with this approach. – gritts Feb 06 '23 at 15:02
  • 1
    @gritts It's probably possible to do that, but per SO policy, you should post it as a separate question. – Jack Fleeting Feb 06 '23 at 21:37