The only way you can really find out if something will actually parse is to...try and parse it.
An XMl document should (but may not) have an XML declaration at the head of the file, following the BOM (if present). It should look something like this:
<?xml version="1.0" encoding="UTF-8" ?>
Though the encoding attribute is, I believe, optional (defaulting to UTF-8. It might also have a standalone
attribute whose value is yes
or no
. If that is present, that's a pretty good indicator that the document is supposed to be valid XML.
Riffing on @GaryWalker's excellent answer, something like this is about as good as it gets, I think (though the settings might need some tweaking, a custom no-op resolver perhaps). Just for kicks, I generated a 300mb random XML file using XMark xmlgen
(http://www.xml-benchmark.org/): validating it with the code below takes 1.7–1.8 seconds elapsed time on my desktop machine.
public static bool IsMinimallyValidXml( Stream stream )
{
XmlReaderSettings settings = new XmlReaderSettings
{
CheckCharacters = true ,
ConformanceLevel = ConformanceLevel.Document ,
DtdProcessing = DtdProcessing.Ignore ,
IgnoreComments = true ,
IgnoreProcessingInstructions = true ,
IgnoreWhitespace = true ,
ValidationFlags = XmlSchemaValidationFlags.None ,
ValidationType = ValidationType.None ,
} ;
bool isValid ;
using ( XmlReader xmlReader = XmlReader.Create( stream , settings ) )
{
try
{
while ( xmlReader.Read() )
{
; // This space intentionally left blank
}
isValid = true ;
}
catch (XmlException)
{
isValid = false ;
}
}
return isValid ;
}
static void Main( string[] args )
{
string text = "<foo>This &SomeEntity; is about as simple as it gets.</foo>" ;
Stream stream = new MemoryStream( Encoding.UTF8.GetBytes(text) ) ;
bool isValid = IsMinimallyValidXml( stream ) ;
return ;
}