2

I have many long documents that need to be parsed. The document format is like XML but not actually xml.

Here's an example:

<DOC>
    <TEXT>it's the content P&G</TEXT>
</DOC> 
<DOC>
    <TEXT>it's antoher</TEXT>
</DOC>

Note that there are mutiple root tags - <DOC>, and the entity & should be &amp; in xml.

Thus, the above file is not a standard xml.

Can I use the XmlDocument to parse the file, or should I write my own parser?

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
daisydan
  • 23
  • 2
  • 1
    Would a replace of & with &, and wrapping the whole string with ... be enough? Or is there more? – Adrian Wragg Jul 19 '13 at 09:37
  • Since it's not XML you'll not be able to use an XML parser. You'll want to decide what it really is, and then use a parser for that thing. – David Heffernan Jul 19 '13 at 09:39
  • I'm going to be brutal and remove the "XML" tag, since this is a question about how to parse some language that isn't XML. – Michael Kay Jul 19 '13 at 13:28

3 Answers3

6

What you are saying is somewhat incorrect - that this is "not standard XML". The document is not XML. Period.

You cannot use XmlDocument or any other XML parser to parse it as a complete document.

You need to ensure that you have valid XML before you try to parse it using an XML parser.

So - in this case, either warp the document in a root element or break it out to several documents. In either case, you need to ensure that the special characters are encoded correctly (quotes, ampersands etc...).

The answer by oakio gets you part way by treating the document as an XML fragment, but this still doesn't help with invalid content such as unescaped ampersands.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
2

As @Oded says, this isn't an XML document - just some text.

However with some pre-parsing you might be able to convert it:

Wrap the whole thing in a new root node:

<DOCS>
    <DOC>
        <TEXT>it's the content P&G</TEXT>
    </DOC> 
    <DOC>
        <TEXT>it's antoher</TEXT>
    </DOC>
<DOCS>

And search for the disallowed chars and replace with their entities (eg &apos; and &amp;).

As pointed out in the comments, you should replace & first to avoid double encoding (ie ending up with &apos;amp;)

You might have to do this via string manipulation though, depending on where you're getting the data from.

Jon Egerton
  • 40,401
  • 11
  • 97
  • 129
  • 1
    Only issue with the string replace is double encoding (in particular when replacing `&` with `&`, which is why it should be the _first_ replacement). – Oded Jul 19 '13 at 09:49
1

Yes, but you should set XmlReaderSettings.ConformanceLevel:

XmlReaderSettings settings = new XmlReaderSettings()
{
    ConformanceLevel = ConformanceLevel.Fragment
};
using (XmlReader reader = XmlReader.Create(stream, settings))
{
    //TODO: read here
}

More: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.conformancelevel.aspx)

oakio
  • 1,868
  • 1
  • 14
  • 21