0

I need a Python program that will merge all XML files in a folder, remove duplicate part IDs, and then output to a new XML file.

I have the below code that was mostly provided to me by someone else, but I do not understand a lot of the logic. I'm not sure if my main function is ordered correctly with the lines of code inside there. I keep running into this error further down in my post where the pd.concat() is throwing an error saying no such file or directory even though os.listdir(input_folder) should be correct.

Could someone please explain why I keep getting this error? And help breakdown the logic in each line of this code? I would greatly appreciate some help to understand what's happening in each step.

import os
import xmltodict
import pandas as pd
from pathlib import Path

input_folder = r"C:\Users\me\OneDriveS\Documents\Data Conversion\ACES & PIES conversion\Sample Files"
output_folder = r"C:\Users\me\OneDrive\Documents\Data Conversion\ACES & PIES conversion\Sample Output\test.xml"


def merge_aces(input_folder, output_folder):
    df = pd.concat(
        (
            pd.json_normalize(xmltodict.parse(Path(file).read_bytes())).assign(
                filename=file
            )
            for file in os.listdir(input_folder)
        ),
        ignore_index=True,
    ).explode("ACES.App")

    df["Aces.App.Part"] = pd.json_normalize(df["ACES.App"]).set_index(df.index)["Part"]

    filename, xmlout = df_to_xmltodict(df.drop_duplicates("Aces.App.Part"))

    Path(filename).write_text(xmlout)


def df_to_xmltodict(df, sep="."):
    result = {"ACES": {"Header": {}, "App": []}}
    for _, row in df.iterrows():
        new_row = {}
        for name, value in row.items():
            keys = name.split(sep)
            count = len(keys) - 1
            parent = new_row
            for index, key in enumerate(keys):
                if index == count:
                    parent[key] = value
                else:
                    parent.setdefault(key, {})
                    parent = parent[key]

        result["ACES"]["App"].append(new_row["ACES"]["App"])

        del new_row["ACES"]["App"]
        del new_row["Aces"]

        filename = new_row.pop("filename", None)

        for key, value in new_row.items():
            result["ACES"].update(value)

    return filename, result


if __name__ == "__main__":
    merge_aces(input_folder, output_folder)

Anyway, I keep getting this error below when I run my program:

File "C:\Users\me\OneDrive - \Documents\Data Conversion\ACES & PIES conversion\Python Program\aces_merge.py", line 58, in <module>

merge_aces(input_folder, output_folder)

File "C:\Users\me\OneDrive - \Documents\Data Conversion\ACES & PIES conversion\Python Program\aces_merge.py", line 12, in merge_aces

df = pd.concat(

^^^^^^^^^^

File "C:\Users\me\AppData\Roaming\Python\Python311\site-packages\pandas\util\_decorators.py", line 331, in wrapper

return func(*args, **kwargs)

^^^^^^^^^^^^^^^^^^^^^

File "C:\Users\me\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\concat.py", line 368, in concat

op = _Concatenator(

^^^^^^^^^^^^^^

File "C:\Users\me\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\concat.py", line 422, in __init__

objs = list(objs)

^^^^^^^^^^

File "C:\Users\me\OneDrive - \Documents\Data Conversion\ACES & PIES conversion\Python Program\aces_merge.py", line 14, in <genexpr>

pd.json_normalize(xmltodict.parse(Path(file).read_bytes())).assign(

^^^^^^^^^^^^^^^^^^^^^^^

File "C:\Program Files\Python311\Lib\pathlib.py", line 1050, in read_bytes

with self.open(mode='rb') as f:

^^^^^^^^^^^^^^^^^^^^

File "C:\Program Files\Python311\Lib\pathlib.py", line 1044, in open

return io.open(self, mode, buffering, encoding, errors, newline)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

FileNotFoundError: [Errno 2] No such file or directory: 'ANH_CSB_ACES_full-022320230223.xml'

Below is an example of one of the XML files with sensitive info removed. There are tons of these app tags in each file, but each file always has this header tag. I want to retain this type of structure in the output XML file.

<?xml version="1.0" encoding="utf-8"?>
<ACES version="4.2">
     <Header>
          <Company>x</Company>
          <SenderName>y</SenderName>
          <SenderPhone>z</SenderPhone>
          <TransferDate>a</TransferDate>
          <BrandAAIAID>b</BrandAAIAID>
          <DocumentTitle>c</DocumentTitle>
          <DocFormNumber>2.0</DocFormNumber>
          <EffectiveDate>2023-02-22</EffectiveDate>
          <SubmissionType>FULL</SubmissionType>
          <MapperCompany>d</MapperCompany>
          <MapperContact>e</MapperContact>
          <MapperPhone>f</MapperPhone>
          <MapperEmail>g</MapperEmail>
          <VcdbVersionDate>2023-01-26</VcdbVersionDate>
          <QdbVersionDate>2023-01-26</QdbVersionDate>
          <PcdbVersionDate>2023-01-26</PcdbVersionDate>
     </Header>
     <App action="A" id="1">
          <BaseVehicle id="5911"/>
          <BodyType id="5"/>
          <EngineBase id="560"/>
          <Note>WITHOUT AUTO LEVELING SYSTEM</Note>
          <Qty>1</Qty>
          <PartType id="7600"/> 
          <Position id="104"/>
          <Part>701940</Part>
      </App>
  • Code does not look good to me. `output_folder` is defined as a file and not used inside the method – LMC Mar 07 '23 at 21:23
  • If I understood your question right, you will merge n xml to one, without duplicate branches. Than you should share two dummy xml which shows your problem and the expected result. If you don’t understood the code, then please don’t use it. – Hermann12 Mar 07 '23 at 22:00

0 Answers0