efficiently removing duplicate xml elements in c#

Question

I have couple of XML files that contain lots of duplicate entries, such as these.

<annotations>
  <annotation value=",Clear,Outdoors" eventID="2">
    <image location="Location 1" />
    <image location="Location 2" />
    <image location="Location 2" />
  </annotation>

  <annotation value=",Not a problem,Gravel,Shopping" eventID="2">
    <image location="Location 3" />
    <image location="Location 4" />
    <image location="Location 5" />
    <image location="Location 5" />
    <image location="Location 5" />
  </annotation>
</annotations>

I want to remove the duplicate elements in the each of the child. The way I approached this is by copying all the elements to a list and then comparing them,

 foreach (var el in xdoc.Descendants("annotation").ToList())
   {
      foreach (var x in el.Elements("image").Attributes("location").ToList())
       {
           //add elements to a list
       }
   }

half way through I realized this is very inefficient and time consuming. I'm fairly new to XML, I was wondering if there are any built in methods in C# that I can use to remove duplicates?.

I tried using

if(!x.value.Distinct()) // can't convert collections to bool
    x.Remove();

But that doesn't work, neither does

if(x.value.count() > 1) // value.count returns the number of elements.
   x.Remove()

score 6 · Accepted Answer · answered Sep 12 '14 at 16:36

6

using System.Xml.Linq;

XDocument xDoc = XDocument.Parse(xmlString);
xDoc.Root.Elements("annotation")
         .SelectMany(s => s.Elements("image")
                           .GroupBy(g => g.Attribute("location").Value)
                           .SelectMany(m => m.Skip(1))).Remove();

answered Sep 12 '14 at 16:36

Tony Stark

781
6
22

1

Heh, this is why I don't use linq, I find that really hard to follow. That's a criticism of Linq, not the answer. – Flynn1179 Sep 12 '14 at 16:39
Doesn't `XDocument.parse()` takes in a string? or does it work if I pass in the path to my document?. – cyberbemon Sep 12 '14 at 16:47
1

for passing XML document path use 'XDocument.Load' – Tony Stark Sep 12 '14 at 17:34
Linq is short-hand for function calls. The value(s) on the left of the `=>` is the variable to the function. Generally it is on a sequence of items. So each g, or m, or whatever you want to call it, is an item in the list. – Chuck Savage Sep 13 '14 at 02:10
1

Upvoted for idea, but I think it's more simply Where, GroupyBy, SelectMany, Remove. – dudeNumber4 Jan 29 '18 at 13:14

score 0 · Answer 2 · answered Sep 12 '14 at 16:25

If your duplicates are always in this form, then you could do this with a bit of XSLT to remove duplicate nodes. The XSLT for this is:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="image[@location = preceding-sibling::image/@location]"/>
</xsl:stylesheet>

If it's something that can happen frequently, then it might be worth having that stylesheet loaded into a XslCompiledTransform instance.

Or you can simply get a list of all duplicate nodes using this XPath:

/annotations/annotation/image[@location = preceding-sibling::image/@location]

and remove them from their parent.

score 0 · Answer 3 · edited May 23 '17 at 12:19

There's a couple of things that you could do here. As well as the other answers so far, you can note that Distinct() has an overload that takes an IEqualityComparer. You could use something like this ProjectionEqualityComparer to do something like this:

var images = xdoc.Descendants("image")
    .Distinct(ProjectionEqualityComparer<XElement>.Create(xe => xe.Attributes("location").First().Value))

... which would give you all of the unique "image" elements that have unique location attributes.

efficiently removing duplicate xml elements in c#

3 Answers3

Linked