I have thousands of SGML documents, some well-formed, some not so well-formed. I need to get at certain ELEMENTS in the documents, but everytime I go to load and try to read them into an XDocument, XMLDocument, or even just a StreamReader, I get different various XMLException errors.
Things like "'[' is an unexpected token.". Why? Because I have a document with DOCTYPE like
<!DOCTYPE RChapter PUBLIC "-//LSC//DTD R Chapter for Authoring//EN" [] >
and I have learned that the "[]" needs to have something valid inside. Again, I don't control the creation of the documents, but I DO HAVE to "crack" them and get at the data I want. Another example is having an "unclosed" ELEMENT, for example:
<Caption>Plants, and facilities<hardhyphen><hyphen>Inspection.</Caption>
This XMLException is "The 'hyphen' start tag on line 27 does not match the end tag of 'Caption'. Line 27, position 58." Obvious, right?
But then the question is how can you actually get at certain ELEMENTS in these documents, without encountering XMLExceptions. Is a SAX parser the right way? I basically want to open the document, go right to the element I want (without worrying what might or might not be well-formed nearby), pull the data, and move on. Should I just forget parsing with XMLDocument, XDocument, and just do simple string replacements like
str.Replace("<hardhypen><hyphen>", "-")
and then try to load it into one of the XML parsers. Any tips on strategies?