6

I am parsing some XML files from a third party provider and unfortunately it's not always well-formed XML as sometimes some elements contain duplicate attributes.

I don't have control over the source and I don't know which elements may have duplicate attributes nor do I know the duplicate attribute names in advance.

Obviously, loading the content into an XMLDocument object raises an XmlException on the duplicate attributes so I though I could use an XmlReader to step though the XML element by element and deal with the duplicate attributes when I get to the offending element.

However, the XmlException is raised on reader.Read() - before I get a chance to insepct the element's attributes.

Here's a sample method to demonstrate the issue:

public static void ParseTest()
{
    const string xmlString = 
        @"<?xml version='1.0'?>
        <!-- This is a sample XML document -->
        <Items dupattr=""10"" id=""20"" dupattr=""33"">
            <Item>test with a child element <more/> stuff</Item>
        </Items>";

    var output = new StringBuilder();
    using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
    {
        XmlWriterSettings ws = new XmlWriterSettings();
        ws.Indent = true;
        using (XmlWriter writer = XmlWriter.Create(output, ws))
        {
            while (reader.Read())   /* Exception throw here when Items element encountered */
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        writer.WriteStartElement(reader.Name);
                        if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
                        break;
                    case XmlNodeType.Text:
                        writer.WriteString(reader.Value);
                        break;
                    case XmlNodeType.XmlDeclaration:
                    case XmlNodeType.ProcessingInstruction:
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                        break;
                    case XmlNodeType.Comment:
                        writer.WriteComment(reader.Value);
                        break;
                    case XmlNodeType.EndElement:
                        writer.WriteFullEndElement();
                        break;
                }
            }

        }
    }
    string str = output.ToString();
}

Is there another way to parse the input and remove the duplicate attributes without having to use regular expressions and string manipulation?

Catch22
  • 3,261
  • 28
  • 34
  • It can only be possible if the XML processor API provider any hooks that allows you to hook into the processing and handle the error conditions – Ankur Jul 07 '11 at 11:24
  • Interesting problem, look forward to seeing the solution! – Kieren Johnstone Jul 07 '11 at 11:58
  • 2
    There will be no solution to this problem using XML, because your input is not XML. You say you have no control over the input, but can you at least make your superiors aware that your vendor is not sending you XML? Can you at least make sure that your _vendor_ knows this? Any organization stupid enough to send this data might be stupid enough to not realize that it's not XML. – John Saunders Jul 07 '11 at 15:10
  • Makes sense I guess. So I've reverted to treating the content as a string which I clean up before parsing as XML. Turns out there was only one offending attribute right at the start of the XML from the vendor and 10,000 lines later, everything else is clean. – Catch22 Jul 12 '11 at 08:57
  • I think yor question has already been answered here: http://stackoverflow.com/questions/4085065/xml-linq-removing-duplicate-nodes-in-xelement-c – saj Jul 07 '11 at 14:14
  • -1: no, that question is about removing elements which are duplicates of each other based on attribute values. This question is about "elements" which have multiple copies of the same attribute, like ``, which is not XML. – John Saunders Jul 07 '11 at 15:11

2 Answers2

4

I found a solution by thinking of the XML as an HTML document. Then using the open-source Html Agility Pack library, I was able to get valid XML.

The trick was to save the xml with a HTML header first.
So replace the XML declaration
<?xml version="1.0" encoding="utf-8" ?>
with an HTML declaration like this:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Once the contents are saved to file, this method will return a valid XML Document.

// Requires reference to HtmlAgilityPack
public XmlDocument LoadHtmlAsXml(string url)
{
    var web = new HtmlWeb();

    var m = new MemoryStream();
    var xtw = new XmlTextWriter(m, null);

    // Load the content into the writer
    web.LoadHtmlAsXml(url, xtw);

    // Rewind the memory stream
    m.Position = 0;

    // Create, fill, and return the xml document
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.LoadXml((new StreamReader(m)).ReadToEnd());
    return xmlDoc;
}

The duplicate attribute nodes are automatically removed with the later attribute values overwriting the earlier ones.

Catch22
  • 3,261
  • 28
  • 34
0

Ok think you need to catch the error:

Then you should be able to use the following methods:

reader.MoveToFirstAttribute();

and

reader.MoveToNextAttribute()

to get the following properties:

reader.Value
reader.Name

This will enable you to get all the attribute values.

openshac
  • 4,966
  • 5
  • 46
  • 77
  • I can catch the error and process the attributes on the current node (i.e. copy non duplicates) but the problem then is continuing with processing the rest of the document as `reader.Read()` returns false so no more elements get processed. – Catch22 Jul 07 '11 at 13:02
  • #Catch22, yep I did come across that whilst trying to get the code to resume. I hoped you'd find a way around it. Have a look here: http://bytes.com/topic/c-sharp/answers/827965-how-handle-xml-parsing-exception it looks like XMLReader is error intolerant for a reason. This would normally be good news but in your case it means my suggested solution probably won't work. sorry – openshac Jul 07 '11 at 15:51