0

The system I'm working on uses DataSet.ReadXml(XmlReader) to read an XML file and load its contents to a DataSet. The XML file is from a business partner and may not always be well-formed, but this system is expected to perform reasonable corrections to the input.

We've seen errors in the XML input files, such as:

  • Case 1: In the middle of a string value, use of characters such as '<', '>', or my favorite, '&', which causes "An error occurred while parsing EntityName. Line x, position y."
  • Case 2: In the middle of a string value, weird constructs such as "<3" so that the text depicts a heart, which causes "Name cannot begin with the '3' character. Line x, position y."
  • Case 3: Invalid characters for the given encoding, which causes "Invalid character in the given encoding. Line x, position y."

If some simple rules are adopted, these errors can be addressed programmatically:

  • Case 1: Replace the offending character with its XML character entity ("&" becomes "&amp;", etc.
  • Case 2: Replace the "<" in "<3" with a space, so that it becomes " 3"
  • Case 3: Replace the invalid character with a space

However, all of these errors raise the same exception: System.Xml.XmlException

I would like to take an appropriate action when any of these errors are encountered, but what's the best way to do that? These three different errors all have the same HRESULT value (-2146232000), and so far the only way I have been able to differentiate amongst them is by inspection of the XmlException.Message string property.

String comparison seems a lousy way to determine the exact cause of the error. Were I to follow that approach, the code would break should the exception message change in future versions of .NET. It would also not be portable to some languages.

Therefore, how does one programmatically differentiate between the various types of errors that could be represented in an XmlException?

EDIT

In the comments below I've received advice about the importance of ensuring that XML data is of high quality. I don't disagree, but as my question states, it's outside my control and I can do nothing about it. So, as well-intentioned as your remarks are, they miss the point. If you know a good way to differentiate amongst the very many errors that can be presented by the System.Xml.XmlException class, please, share your knowledge. Thank you.

STLDev
  • 5,950
  • 25
  • 36
  • If you don't have an xml format with your partners and rules which is the content of its element then you are absolutely facing this problem all the time. It's sounds like you are always waiting what will be the next error return. – Vijunav Vastivch Aug 11 '17 at 01:32
  • @reds, I agree with your statement. Though I've been successful in getting many things addressed, the reality of the situation is that it is what it is. – STLDev Aug 11 '17 at 01:39
  • 1
    I think you should have to do negative testing with your project hoping that all possible errors would be catch and addressed. – Vijunav Vastivch Aug 11 '17 at 01:45
  • Great...but this doesn't help answer my question. – STLDev Aug 11 '17 at 01:46
  • Yeah i'm just giving some advice.. sorry for that.. Regarding with the answer , I know you have an idea on how to resolve it. As your question even anyone doesn't know what will be the next error return. – Vijunav Vastivch Aug 11 '17 at 01:50
  • No worries - hope I didn't sound brusque. Thanks for trying to help. – STLDev Aug 11 '17 at 01:52
  • It's fine.. thanks for appreciating too. – Vijunav Vastivch Aug 11 '17 at 01:54
  • 1
    What about use HtmlAgilityPack and than rebuild Xml node-by-node? There is really no good replacement for rejecting invalid XML... – Alexei Levenkov Aug 11 '17 at 02:01
  • You are putting the cart before the horse. None of these situations should ever occur. The real issues are the programs that generated the xml. Any program that is used to generate xml should be validated before being used so none of these issues should occur. The only type xml error that is allowable is multiple root tags. Often xml format is used in log files where you will get xml that is not well formed since the log data is appended and you cannot guarantee one root tag. – jdweng Aug 11 '17 at 02:35

1 Answers1

0

Rather than trying to parse non-XML with an XML parser and catching the errors, if you really want to process non-XML then I would try preprocessing it with a parser for the particular non-XML grammar that you want to accept. Before you ever submit the data to an XML parser, run it through a Perl script or similar that recognizes the patterns that you want to convert to XML, then run the result through an XML parser.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164