0

I have an XML to parse and load it in to a dataframe. The XML has duplicate tag so using

xmldataframe <- xmlToDataFrame( "C:\Sample.XML") is not working and throwing an error saying Error in [<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c("C", : duplicate subscripts for columns

When I remove the duplicate tags manually and try to execute it works. But the problem is I have huge real time XML, i couldn't correct all of them, because I couldn't find the duplicate tags.

  1. Is there a way to find out duplicate TAG's so I can remove manually?
  2. If there are duplicates can i have clubbed in to same column in the dataframe?

Here is the sample XML.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<IesEnhancedAttributes>
    <EnhancedAttribute>
        <action>C</action>
        <cleiCode>SDDFDFDFD</cleiCode>
        <physicalDescription>Small Form Factor(SFF), (e.g., SFP, GBIC, XFP, XPAK)</physicalDescription>
        <height_metric unit="mm">8.6</height_metric>
        <height_english unit="in">0.339</height_english>
        <width_metric unit="mm">13.7</width_metric>
        <width_english unit="in">0.539</width_english>
        <depth_metric unit="mm">56.5</depth_metric>
        <depth_english unit="in">2.224</depth_english>
            <depth_english unit="in">3.333</depth_english>
        <weight_metric unit="NS"></weight_metric>
        <weight_english unit="NS"></weight_english>
        <MaximumPowerUsage unit="NA"></MaximumPowerUsage>
        <operatingTemperature_metric_min unit="NS"></operatingTemperature_metric_min>
        <operatingTemperature_metric_max unit="NS"></operatingTemperature_metric_max>
        <operatingTemperature_english_min unit="NS"></operatingTemperature_english_min>
        <operatingTemperature_english_max unit="NS"></operatingTemperature_english_max>
        <storageTemperature_metric_min unit="NS"></storageTemperature_metric_min>
        <storageTemperature_metric_max unit="NS"></storageTemperature_metric_max>
        <storageTemperature_english_min unit="NS"></storageTemperature_english_min>
        <storageTemperature_english_max unit="NS"></storageTemperature_english_max>
        <humidity_min unit="NS">0</humidity_min>
        <humidity_max unit="NS">0</humidity_max>
        <altitude_metric_min unit="NS"></altitude_metric_min>
        <altitude_metric_max unit="NS"></altitude_metric_max>
        <altitude_english_min unit="NS"></altitude_english_min>
        <altitude_english_max unit="NS"></altitude_english_max>
        <alarmCapable>Y</alarmCapable>
        <PCNChange></PCNChange>
        <orderingCode>81.SOC12IR1131S</orderingCode>
        <maximumHeatDissipation_metric unit="NS"></maximumHeatDissipation_metric>
        <maximumHeatDissipation_english unit="NS"></maximumHeatDissipation_english>
        <frameSpacing_metric unit="NA"></frameSpacing_metric>
        <frameSpacing_english unit="NA"></frameSpacing_english>
    </EnhancedAttribute>
    <EnhancedAttribute>
        <action>C</action>
        <cleiCode>FDFDFDFDFDF</cleiCode>
        <physicalDescription>Small Form Factor(SFF), (e.g., SFP, GBIC, XFP, XPAK)</physicalDescription>
        <height_metric unit="mm">8.6</height_metric>
        <height_english unit="in">0.339</height_english>
        <width_metric unit="mm">13.7</width_metric>
        <width_english unit="in">0.539</width_english>
        <depth_metric unit="mm">56.5</depth_metric>
        <depth_english unit="in">2.224</depth_english>
        <weight_metric unit="NS"></weight_metric>
        <weight_english unit="NS"></weight_english>
        <MaximumPowerUsage unit="NA"></MaximumPowerUsage>
        <operatingTemperature_metric_min unit="NS"></operatingTemperature_metric_min>
        <operatingTemperature_metric_max unit="NS"></operatingTemperature_metric_max>
        <operatingTemperature_english_min unit="NS"></operatingTemperature_english_min>
        <operatingTemperature_english_max unit="NS"></operatingTemperature_english_max>
        <storageTemperature_metric_min unit="NS"></storageTemperature_metric_min>
        <storageTemperature_metric_max unit="NS"></storageTemperature_metric_max>
        <storageTemperature_english_min unit="NS"></storageTemperature_english_min>
        <storageTemperature_english_max unit="NS"></storageTemperature_english_max>
        <humidity_min unit="NS">0</humidity_min>
        <humidity_max unit="NS">0</humidity_max>
            <humidity_max unit="NS">1</humidity_max>
        <altitude_metric_min unit="NS"></altitude_metric_min>
        <altitude_metric_max unit="NS"></altitude_metric_max>
        <altitude_english_min unit="NS"></altitude_english_min>
        <altitude_english_max unit="NS"></altitude_english_max>
        <alarmCapable>Y</alarmCapable>
        <PCNChange></PCNChange>
        <HazardousMaterialIndicator>6</HazardousMaterialIndicator>
        <orderingCode>81.SOC12IR1131S</orderingCode>
        <frameSpacing_metric unit="NA"></frameSpacing_metric>
        <frameSpacing_english unit="NA"></frameSpacing_english>
    </EnhancedAttribute>
</IesEnhancedAttributes>
Joe Vijay
  • 3
  • 1
  • You certainly can identify duplicate elements via XSLT, but if your input XML is huge then *manually* removing duplicates is presumably not a viable option. Luckily, XSLT can help with that, too, but (1) it's not clear to me how, exactly, you want to handle such duplicates, and (2) we are not a code-writing service. – John Bollinger Jun 27 '17 at 20:09
  • Thanks for the XSLT option, I'm going to explore that further... 1. For Duplicate, I just need 1 out of many. 2.Sorry, I dint mean to ask for the piece of code.. – Joe Vijay Jun 28 '17 at 21:13

1 Answers1

0

Consider the Muenchian Grouping in XSLT to remove duplicate elements and then have R read in the output. Since R does not have a universal package to run the special-purpose language, R can make calls with system() to external XSLT processors even other scripts like PHP/Python/Java that can run XSLT 1.0. Below are examples for R on Unix (Linux/Mac) and Windows:

XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:outpu 1.t method="xml" indent="yes"/>

    <xsl:key name="elemid" match="EnhancedAttribute/*" 
             use="concat(count(../preceding-sibling::*) + 1, name())"/>

    <xsl:template match="/IesEnhancedAttributes">  
        <xsl:copy> 
            <xsl:apply-templates select="EnhancedAttribute"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="EnhancedAttribute"> 
        <xsl:copy>     
            <xsl:copy-of select="*[generate-id() = generate-id(key('elemid', 
                  concat(count(../preceding-sibling::*) + 1, name()))[1])]"/> 
        </xsl:copy>       
    </xsl:template>

</xsl:stylesheet>

R for Unix using xsltproc

library(XML)

setwd('/path/to/working/directory')
system(paste0('cd ', getwd(), ' && xsltproc -o Output.xml XSLTScript.xsl Input.xml'))

doc <- xmlParse('Output.xml')
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//EnhancedAttribute"))

R for Windows using PowerShell script

library(XML)

system(paste0('Powershell.exe -File',
              ' "C:\\Path\\To\\PowerShell\\Script.ps1"',
              ' "C:\\Path\\To\\Input.xml"',
              ' "C:\\Path\\To\\XSLT\\Script.xsl"', 
              ' "C:\\Path\\To\\Output.xml"'))

doc <- xmlParse('Output.xml')
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//EnhancedAttribute"))
Parfait
  • 104,375
  • 17
  • 94
  • 125