How to find duplicates in multiple XML files and merge into one XML file in python?

Question

I have 3 XML files in which I need to compare against each other. I need to find the duplicates in each file by a part ID inside of these app tags. Then, the file that is the largest (i.e. has the most data) I need to merge those duplicates to that large file and append the unique data from the other 2 files into that XML file.

I am somewhat familiar with the ElementTree module, however I am pretty stumped as to how to make this easier for myself to find duplicate data across all 3 files and do the remaining actions.

Does anyone have any pointers on how to accomplish this or at least get to the first step? I know how to parse xml files and use a for loop to go through all the elements in the root, but I'm not sure how I would grab the duplicate part IDs and the unique data to get it all into the largest file.

My Attempts:

I thought maybe I could convert these files into pandas dataframes using the read_xml() function but when I use a to_xml() function it doesn't retain the data nicely in the tags/elements. I scratched this idea since I think it wasn't the right approach.
I tried googling functions in the ElementTree to see if there's a way to find duplicates and I didn't find any native way to do this
To find the largest file I found out that the OS module has
os.path.getsize() which I think i can use for my criteria for merging into the largest file

My attempted code below using pandas:

import xml.etree.ElementTree as ET
import pandas as pd
import os

input_folder = r"C:\Users\me\OneDrive\Documents\Data Conversion\ACES & PIES conversion\Sample Files"


def merge_files(input_folder):
    xml_list = []
    for file in os.listdir(input_folder):
        df = pd.read_xml(f"{input_folder}/{file}")
        xml_list.append(df)
    
    pd.set_option('display.max_columns', None)
    xml_merged = pd.concat(xml_list, ignore_index=True)
    df2 = pd.DataFrame(xml_merged)
    print(df2.info())


merge_files(input_folder)

I have the below results when I show the df2.info():

Output exceeds the size limit. Open the full output data in a text editor
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274460 entries, 0 to 274459
Data columns (total 48 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Company                  3 non-null       object 
 1   SenderName               3 non-null       object 
 2   SenderPhone              3 non-null       object 
 3   TransferDate             3 non-null       object 
 4   BrandAAIAID              3 non-null       object 
 5   DocumentTitle            3 non-null       object 
 6   DocFormNumber            3 non-null       float64
 7   EffectiveDate            3 non-null       object 
 8   SubmissionType           3 non-null       object 
 9   MapperCompany            3 non-null       object 
 10  MapperContact            3 non-null       object 
 11  MapperPhone              3 non-null       object 
 12  MapperEmail              3 non-null       object 
 13  VcdbVersionDate          3 non-null       object 
 14  QdbVersionDate           3 non-null       object 
 15  PcdbVersionDate          3 non-null       object 
 16  action                   274454 non-null  object 
 17  id                       274454 non-null  float64
 18  BaseVehicle              0 non-null       float64
 19  EngineBase               0 non-null       float64
...
 47  FuelSystemDesign         0 non-null       float64
dtypes: float64(30), object(18)
memory usage: 100.5+ MB
None

Example of contents of one of the XML files below. There are many app tags that repeat throughout the file.

<?xml version="1.0" encoding="utf-8"?>
<ACES version="4.2">
     <Header>
          <Company>x</Company>
          <SenderName>y</SenderName>
          <SenderPhone>z</SenderPhone>
          <TransferDate>a</TransferDate>
          <BrandAAIAID>b</BrandAAIAID>
          <DocumentTitle>c</DocumentTitle>
          <DocFormNumber>2.0</DocFormNumber>
          <EffectiveDate>2023-02-22</EffectiveDate>
          <SubmissionType>FULL</SubmissionType>
          <MapperCompany>d</MapperCompany>
          <MapperContact>e</MapperContact>
          <MapperPhone>f</MapperPhone>
          <MapperEmail>g</MapperEmail>
          <VcdbVersionDate>2023-01-26</VcdbVersionDate>
          <QdbVersionDate>2023-01-26</QdbVersionDate>
          <PcdbVersionDate>2023-01-26</PcdbVersionDate>
     </Header>
     <App action="A" id="1">
          <BaseVehicle id="5911"/>
          <BodyType id="5"/>
          <EngineBase id="560"/>
          <Note>WITHOUT AUTO LEVELING SYSTEM</Note>
          <Qty>1</Qty>
          <PartType id="7600"/> 
          <Position id="104"/>
          <Part>701940</Part>
      </App>

*but when I use a to_xml() function it doesn't retain the data nicely in the tags/elements* ... since `to_xml()` maintains the stylesheet argument to run XSLT, you can nicely render *any* tag/elements you need from default output. Please post sample XML for [mre]. — Parfait, Mar 04 '23 at 14:53
Thank you for your replies. I have edited my original post to include sample XML with sensitive info removed from the text elements. Please let me know your thoughts as to what I need to do in order to make to_xml() from pandas render the data appropriately. — Kevin Agbulos, Mar 06 '23 at 13:59

score 0 · Answer 1 · answered Mar 15 '23 at 23:20

This approach makes a dictionary from main (biggest perhaps) XML file, creates another dictionary from a second file with elements NOT present on first dict. Then adds those to first root element and writes to first file.

from lxml import etree
doc = etree.parse('tmp.xml')
# parse first file
a = doc.xpath('/ACES/App')
# make a dict with Part value as keys
# {'701941': <Element App at 0x7f3f72e7f648>}
d = {app.xpath('Part/text()')[0]: app for app in a}

# parse second file
doc2 = etree.parse('tmp2.xml')
a2 = doc2.xpath('/ACES/App')
# make a dict with keys NOT present on first dict
d2 = {app.xpath('Part/text()')[0]: app for app in a2 if app.xpath('Part/text()')[0] not in d.keys()}

root = doc.getroot()

for k,v in d2.items():
  root.append(v)

print(etree.tostring(root, pretty_print=True).decode('utf8'))

et = etree.ElementTree(root)
et.write('tmp.xml', pretty_print=True)

Thanks for your suggestion. However, I have multiple of these XML files. Right now I actually have 3 XML files I need to merge and drop duplicates parts. So I would need to create 3 dictionaries to accomplish this? — Kevin Agbulos, Mar 16 '23 at 14:25
@KevinAgbulos yes, one dictionary per file. THe answer shows the concept for 2 files but can be easily extended to more. — LMC, Mar 16 '23 at 14:57

How to find duplicates in multiple XML files and merge into one XML file in python?

1 Answers1

Linked