I need a Python program that will merge all XML files in a folder, remove duplicate part IDs, and then output to a new XML file.
I have the below code that was mostly provided to me by someone else, but I do not understand a lot of the logic. I'm not sure if my main function is ordered correctly with the lines of code inside there. I keep running into this error further down in my post where the pd.concat()
is throwing an error saying no such file or directory even though os.listdir(input_folder)
should be correct.
Could someone please explain why I keep getting this error? And help breakdown the logic in each line of this code? I would greatly appreciate some help to understand what's happening in each step.
import os
import xmltodict
import pandas as pd
from pathlib import Path
input_folder = r"C:\Users\me\OneDriveS\Documents\Data Conversion\ACES & PIES conversion\Sample Files"
output_folder = r"C:\Users\me\OneDrive\Documents\Data Conversion\ACES & PIES conversion\Sample Output\test.xml"
def merge_aces(input_folder, output_folder):
df = pd.concat(
(
pd.json_normalize(xmltodict.parse(Path(file).read_bytes())).assign(
filename=file
)
for file in os.listdir(input_folder)
),
ignore_index=True,
).explode("ACES.App")
df["Aces.App.Part"] = pd.json_normalize(df["ACES.App"]).set_index(df.index)["Part"]
filename, xmlout = df_to_xmltodict(df.drop_duplicates("Aces.App.Part"))
Path(filename).write_text(xmlout)
def df_to_xmltodict(df, sep="."):
result = {"ACES": {"Header": {}, "App": []}}
for _, row in df.iterrows():
new_row = {}
for name, value in row.items():
keys = name.split(sep)
count = len(keys) - 1
parent = new_row
for index, key in enumerate(keys):
if index == count:
parent[key] = value
else:
parent.setdefault(key, {})
parent = parent[key]
result["ACES"]["App"].append(new_row["ACES"]["App"])
del new_row["ACES"]["App"]
del new_row["Aces"]
filename = new_row.pop("filename", None)
for key, value in new_row.items():
result["ACES"].update(value)
return filename, result
if __name__ == "__main__":
merge_aces(input_folder, output_folder)
Anyway, I keep getting this error below when I run my program:
File "C:\Users\me\OneDrive - \Documents\Data Conversion\ACES & PIES conversion\Python Program\aces_merge.py", line 58, in <module>
merge_aces(input_folder, output_folder)
File "C:\Users\me\OneDrive - \Documents\Data Conversion\ACES & PIES conversion\Python Program\aces_merge.py", line 12, in merge_aces
df = pd.concat(
^^^^^^^^^^
File "C:\Users\me\AppData\Roaming\Python\Python311\site-packages\pandas\util\_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\concat.py", line 368, in concat
op = _Concatenator(
^^^^^^^^^^^^^^
File "C:\Users\me\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\concat.py", line 422, in __init__
objs = list(objs)
^^^^^^^^^^
File "C:\Users\me\OneDrive - \Documents\Data Conversion\ACES & PIES conversion\Python Program\aces_merge.py", line 14, in <genexpr>
pd.json_normalize(xmltodict.parse(Path(file).read_bytes())).assign(
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\pathlib.py", line 1050, in read_bytes
with self.open(mode='rb') as f:
^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\pathlib.py", line 1044, in open
return io.open(self, mode, buffering, encoding, errors, newline)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'ANH_CSB_ACES_full-022320230223.xml'
Below is an example of one of the XML files with sensitive info removed. There are tons of these app tags in each file, but each file always has this header tag. I want to retain this type of structure in the output XML file.
<?xml version="1.0" encoding="utf-8"?>
<ACES version="4.2">
<Header>
<Company>x</Company>
<SenderName>y</SenderName>
<SenderPhone>z</SenderPhone>
<TransferDate>a</TransferDate>
<BrandAAIAID>b</BrandAAIAID>
<DocumentTitle>c</DocumentTitle>
<DocFormNumber>2.0</DocFormNumber>
<EffectiveDate>2023-02-22</EffectiveDate>
<SubmissionType>FULL</SubmissionType>
<MapperCompany>d</MapperCompany>
<MapperContact>e</MapperContact>
<MapperPhone>f</MapperPhone>
<MapperEmail>g</MapperEmail>
<VcdbVersionDate>2023-01-26</VcdbVersionDate>
<QdbVersionDate>2023-01-26</QdbVersionDate>
<PcdbVersionDate>2023-01-26</PcdbVersionDate>
</Header>
<App action="A" id="1">
<BaseVehicle id="5911"/>
<BodyType id="5"/>
<EngineBase id="560"/>
<Note>WITHOUT AUTO LEVELING SYSTEM</Note>
<Qty>1</Qty>
<PartType id="7600"/>
<Position id="104"/>
<Part>701940</Part>
</App>