19

Does anyone know of a quick way to check if a string is parseable as XML in C#? Preferably something quick, low resource, which returns a boolean whether or not it will parse.

I'm working on a database app which deals with errors that are sometimes stored as XML, and sometimes not. Hence, I'd like to just be able to test the string I grab from the database (contained in a DataTable) very quickly...and not have to resort to any try / catch {} statements or other kludges...unless those are the only way to make it happen.

user978122
  • 5,531
  • 7
  • 33
  • 41
  • Unfortunately, you can't know that some piece of text is valid XML until you've read it all. Every solution you get will be a variation of that. You could find the guys responsible for the malformed XML and give them a [XML Bozo Certification](http://www.codinghorror.com/blog/2006/07/are-you-an-xml-bozo.html), though. – Geeky Guy Sep 09 '13 at 18:42
  • If this is your bottleneck, then the best you can do is throw away completely the idea of storing stuff in XML format and use binary formats. Also, don't store bogus values, but organize it in a manner that allows you to immediately tell types without complex analysis. – Display Name Sep 09 '13 at 18:46
  • @SargeBorsch The storage mechanism is not up to me (dozen or so developers on my floor alone); when an application crashes, the errors are logged in the database, sometimes as XML (a stack trace), sometimes as simply a message. Currently I am using a Linq statement to parse the XML into a more readable format, as simply outputting it into a TextBox 'as is' is kind of messy. However, after I wrote this part, I came to realize (as mentioned earlier) that not all errors are in XML, and this causes Linq to scream. I guess I just wanted a quick way to ensure Linq could parse the XML. – user978122 Sep 10 '13 at 21:18

4 Answers4

18

It sounds like that you sometimes get back XML and sometimes you get back "plain" (non-XML) text.

If that's the case you could just check that the text starts with <:

if (!string.IsNullOrEmpty(str) && str.TrimStart().StartsWith("<"))
    var doc = XDocument.Parse(str);

Since "plain" messages seem unlikely to start with < this may be reasonable. The only thing you need to decide is what to do in the edge case that you have non-XML text that starts with a <?

If it were me I would default to trying to parse it and catching the exception:

if (!string.IsNullOrEmpty(str) && str.TrimStart().StartsWith("<"))
{
    try
    {
        var doc = XDocument.Parse(str);
        return //???
    }   
    catch(Exception ex)
        return str;
}
else
{
    return str;   
}

That way the only time you have the overhead of a thrown exception is when you have a message that starts with < but is not valid XML.

Canavar
  • 47,715
  • 17
  • 91
  • 122
D Stanley
  • 149,601
  • 11
  • 178
  • 240
  • This answer certainly seems to be on-track, given that the question is looking for "How do I tell if this is XML, or something completely different" - not "How do I tell if this is perfect XML". If you want to go one step further, you could try to write a short, safety-focused regex that will only parse the first XML element. – Katana314 Sep 09 '13 at 18:51
  • 1
    Probably should do a `str.TrimStart()` first though. – EkoostikMartin Sep 09 '13 at 18:52
  • @EkoostikMartin added that and a null/empty check. – D Stanley Sep 09 '13 at 18:53
  • @DStanley - nice, this appears to be the best answer, do lightweight checks first, then do the parse. – EkoostikMartin Sep 09 '13 at 18:55
  • A simple, yet useful solution. – user978122 Sep 12 '13 at 20:42
  • Obviously this assumes everything that isnt XML doesnt begin with a "<", which might not be the case! :) – bytedev Jun 11 '14 at 11:01
  • @nashwan No, it states clearly that if it _isn't_ XML, then an exception will be thrown and the text will be returned. But it _only_ tries to parse the text if it begins with a `<` since if it _doesn't_, it _can't_ be XML. You _could_ analyze the text a little deeper (like look for a document element), but there's no 100% guaranteed way of determining if a document is valid XML without reading the whole document. – D Stanley Jun 11 '14 at 12:04
15

You could try to parse the string into an XDocument. If it fails to parse, then you know that it is not valid.

string xml = "";
XDocument document = XDocument.Parse(xml);

And if you don't want to have the ugly try/catch visible, you can throw it into an extension method on the string class...

public static bool IsValidXml(this string xml)
{
    try
    {
        XDocument.Parse(xml);
        return true;
    }
    catch
    {
        return false;
    }
}

Then your code simply looks like if (mystring.IsValidXml()) {

John Kraft
  • 6,811
  • 4
  • 37
  • 53
  • 3
    How is that different from what the OP called a kludge? – Geeky Guy Sep 09 '13 at 18:36
  • Sometimes, "kludges" are the simplest, easiest, best way. It really depends on the situation. – John Kraft Sep 09 '13 at 18:38
  • I think the OP was looking for something short of actually parsing but I still don't call this a kludge. +1 – paparazzo Sep 09 '13 at 18:39
  • @Renan The question you should ask is what exactly makes this a kludge? How else can you see if something will parse without actually parsing it? – Logarr Sep 09 '13 at 18:39
  • 2
    @Logarr There are much more lightweight means of parsing the data if you only care if it's valid XML. Your memory footprint goes from the whole file down to almost nothing, the processing time goes way down, etc. If you have really large files, that can make a difference. – Servy Sep 09 '13 at 18:40
  • I did not downvote this... I'm actually +1'ing it right now. You make a good point there. It just wasn't clear to me at first. – Geeky Guy Sep 09 '13 at 18:40
7

The only way you can really find out if something will actually parse is to...try and parse it.

An XMl document should (but may not) have an XML declaration at the head of the file, following the BOM (if present). It should look something like this:

<?xml version="1.0" encoding="UTF-8" ?>

Though the encoding attribute is, I believe, optional (defaulting to UTF-8. It might also have a standalone attribute whose value is yes or no. If that is present, that's a pretty good indicator that the document is supposed to be valid XML.

Riffing on @GaryWalker's excellent answer, something like this is about as good as it gets, I think (though the settings might need some tweaking, a custom no-op resolver perhaps). Just for kicks, I generated a 300mb random XML file using XMark xmlgen (http://www.xml-benchmark.org/): validating it with the code below takes 1.7–1.8 seconds elapsed time on my desktop machine.

public static bool IsMinimallyValidXml( Stream stream )
{
  XmlReaderSettings settings = new XmlReaderSettings
    {
      CheckCharacters              = true                          ,
      ConformanceLevel             = ConformanceLevel.Document     ,
      DtdProcessing                = DtdProcessing.Ignore          ,
      IgnoreComments               = true                          ,
      IgnoreProcessingInstructions = true                          ,
      IgnoreWhitespace             = true                          ,
      ValidationFlags              = XmlSchemaValidationFlags.None ,
      ValidationType               = ValidationType.None           ,
    } ;
  bool isValid ;

  using ( XmlReader xmlReader = XmlReader.Create( stream , settings ) )
  {
    try
    {
      while ( xmlReader.Read() )
      {
        ; // This space intentionally left blank
      }
      isValid = true ;
    }
    catch (XmlException)
    {
      isValid = false ;
    }
  }
  return isValid ;
}

static void Main( string[] args )
{
  string text = "<foo>This &SomeEntity; is about as simple as it gets.</foo>" ;
  Stream stream = new MemoryStream( Encoding.UTF8.GetBytes(text) ) ;
  bool isValid = IsMinimallyValidXml( stream ) ;
  return ;
}
Community
  • 1
  • 1
Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
  • Totally agree, the only way to actually check if something is XML is to parse the entire thing. Simply checking a char or two (like some of the other answers) does not guarantee a string is XML or not. +1 for setting up the settings ;) – bytedev Jun 11 '14 at 11:03
0

The best answer I've seem for test well-formed XML I know of is What is the fastest way to programatically check the well-formedness of XML files in C#? formedness-of-xml-file" It covers using an XMLReader to do this efficiently.

Community
  • 1
  • 1
Gary Walker
  • 8,831
  • 3
  • 19
  • 41
  • 1
    Down voters please indicate why. This certainly has a smaller memory requirement than XDocument.Parse(). – paparazzo Sep 09 '13 at 20:03
  • I believe the downvotes came before the link was added to the answer, and even then there's no substance to this answer. – Logarr Sep 09 '13 at 20:12
  • @Logarr Please explain. There is substance there. What substance do you fell is incorrect and why? – paparazzo Sep 09 '13 at 20:30
  • 1
    I initially typed in the link, but did not notice it was messed up so it came out empty, fixed it in about a minute. Don't care about my "score" so much, but it does explain the downvotes. I thought using XmlReader was the substance, esp. as the link also referred to avoid processing delays by custom handling within XMLReader if you are using namespaces. I suppose I could have explained this, but I thought that is why I referred to an already well-explained answer. – Gary Walker Sep 09 '13 at 20:56
  • 1
    @Blam The answer is little more than a link. While it's on SO, and therefore unlikely to become a dead link, it's still good practice to post a relevant excerpt from the linked page. (For the record, I have not voted on this answer. Just trying to rationalize the community's response.) – Logarr Sep 09 '13 at 21:07
  • @Logarr Then I understand. Should be more than a link. – paparazzo Sep 09 '13 at 22:36
  • Please post some relevant code or something else to this link only answer. – Serj Sagan Feb 05 '19 at 19:08