2

A piece of HTML that I'm trying to parse contains some attributes values without quotation marks, for example with width and height attributes:

<img src="/static/logo.png" width=75 height=90 />

In the C# code, the reader reads until the next anchor tag.

while (reader.ReadToFollowing("a"))

This statement reports a XmlException:

'75' is an unexpected token. The expected token is '"' or '''. Line 16, position 37.

Is there some XmlReaderSetting to make the XmlReader more lenient? I do not have control over the generated HTML.

byneri
  • 4,050
  • 5
  • 27
  • 28
  • 1
    You should not use XmlReader to parse HTML, e.g., see [Is there an XmlReader equivalent for HTML in .Net?][1]. [1]: http://stackoverflow.com/questions/6452433/is-there-an-xmlreader-equivalent-for-html-in-net – Polyfun Aug 13 '12 at 14:14

2 Answers2

6

In order to read HTML, you'll need a reader designed for that purpose. The HtmlAgilityPack can help you here, as can the SgmlReader referred to in this answer to a related question.

HTML is not XML. They are both based on SGML, but follow different rules. XML has much stricter rules than HTML, which include the need to close all tags and for attributes to be surrounded with single or double quotes. Therefore, unless you are parsing XML-compliant XHTML, XmlReader will not work for you.

carla
  • 1,970
  • 1
  • 31
  • 44
Jeff Yates
  • 61,417
  • 20
  • 137
  • 189
  • how HtmlAgilityPAck can help here? I already tried with `htmlDoc.OptionFixNestedTags = true; htmlDoc.OptionCheckSyntax = true; htmlDoc.OptionAutoCloseOnEnd = true; htmlDoc.OptionOutputOptimizeAttributeValues = true;` and nothing do the trick... – Ninita Mar 21 '16 at 16:39
3

You can use the WebBrowser control as well. Load the file into it and get an HtmlDocument from the WebBrowser.Document property. You can then loop thru the controls.

Belmiris
  • 2,741
  • 3
  • 25
  • 29
  • That's great. Originally I was using HtmlAgilityPack, but for this simple office utility, I did not want to add the DLL dependency (a single EXE is what I want), so this WebBrowser approach will work fine. – byneri Aug 13 '12 at 14:28
  • This is an interesting hack. It feels a little dirty but I can already think of situations where I might use it. – Jeff Yates Aug 13 '12 at 14:31