I have 3 XML files in which I need to compare against each other. I need to find the duplicates in each file by a part ID inside of these app tags. Then, the file that is the largest (i.e. has the most data) I need to merge those duplicates to that large file and append the unique data from the other 2 files into that XML file.
I am somewhat familiar with the ElementTree module, however I am pretty stumped as to how to make this easier for myself to find duplicate data across all 3 files and do the remaining actions.
Does anyone have any pointers on how to accomplish this or at least get to the first step? I know how to parse xml files and use a for loop to go through all the elements in the root, but I'm not sure how I would grab the duplicate part IDs and the unique data to get it all into the largest file.
My Attempts:
I thought maybe I could convert these files into pandas dataframes using the read_xml() function but when I use a to_xml() function it doesn't retain the data nicely in the tags/elements. I scratched this idea since I think it wasn't the right approach.
I tried googling functions in the ElementTree to see if there's a way to find duplicates and I didn't find any native way to do this
To find the largest file I found out that the OS module has
os.path.getsize() which I think i can use for my criteria for merging into the largest file
My attempted code below using pandas:
import xml.etree.ElementTree as ET
import pandas as pd
import os
input_folder = r"C:\Users\me\OneDrive\Documents\Data Conversion\ACES & PIES conversion\Sample Files"
def merge_files(input_folder):
xml_list = []
for file in os.listdir(input_folder):
df = pd.read_xml(f"{input_folder}/{file}")
xml_list.append(df)
pd.set_option('display.max_columns', None)
xml_merged = pd.concat(xml_list, ignore_index=True)
df2 = pd.DataFrame(xml_merged)
print(df2.info())
merge_files(input_folder)
I have the below results when I show the df2.info():
Output exceeds the size limit. Open the full output data in a text editor
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274460 entries, 0 to 274459
Data columns (total 48 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Company 3 non-null object
1 SenderName 3 non-null object
2 SenderPhone 3 non-null object
3 TransferDate 3 non-null object
4 BrandAAIAID 3 non-null object
5 DocumentTitle 3 non-null object
6 DocFormNumber 3 non-null float64
7 EffectiveDate 3 non-null object
8 SubmissionType 3 non-null object
9 MapperCompany 3 non-null object
10 MapperContact 3 non-null object
11 MapperPhone 3 non-null object
12 MapperEmail 3 non-null object
13 VcdbVersionDate 3 non-null object
14 QdbVersionDate 3 non-null object
15 PcdbVersionDate 3 non-null object
16 action 274454 non-null object
17 id 274454 non-null float64
18 BaseVehicle 0 non-null float64
19 EngineBase 0 non-null float64
...
47 FuelSystemDesign 0 non-null float64
dtypes: float64(30), object(18)
memory usage: 100.5+ MB
None
Example of contents of one of the XML files below. There are many app tags that repeat throughout the file.
<?xml version="1.0" encoding="utf-8"?>
<ACES version="4.2">
<Header>
<Company>x</Company>
<SenderName>y</SenderName>
<SenderPhone>z</SenderPhone>
<TransferDate>a</TransferDate>
<BrandAAIAID>b</BrandAAIAID>
<DocumentTitle>c</DocumentTitle>
<DocFormNumber>2.0</DocFormNumber>
<EffectiveDate>2023-02-22</EffectiveDate>
<SubmissionType>FULL</SubmissionType>
<MapperCompany>d</MapperCompany>
<MapperContact>e</MapperContact>
<MapperPhone>f</MapperPhone>
<MapperEmail>g</MapperEmail>
<VcdbVersionDate>2023-01-26</VcdbVersionDate>
<QdbVersionDate>2023-01-26</QdbVersionDate>
<PcdbVersionDate>2023-01-26</PcdbVersionDate>
</Header>
<App action="A" id="1">
<BaseVehicle id="5911"/>
<BodyType id="5"/>
<EngineBase id="560"/>
<Note>WITHOUT AUTO LEVELING SYSTEM</Note>
<Qty>1</Qty>
<PartType id="7600"/>
<Position id="104"/>
<Part>701940</Part>
</App>