I am attempting to use C# XElement to parse html. In HTML, src attributes have urls and query strings containing ? and / Is it possible to make them parsable?
Asked
Active
Viewed 322 times
0
-
That will work just fine. However, HTML is usually not valid XML; consider using HTML Agility Pack. – SLaks Aug 29 '13 at 18:23
-
It does not. It will throw '<' is an unexpected token. The expected token is ';'. Line 709, position 43. HTML Agility Pack is full of bugs. I don't have much faith in it... – Hoy Cheung Aug 29 '13 at 18:24
-
@user1978421: `<` isn't `?` or `/` is it? – Jon Skeet Aug 29 '13 at 18:25
-
Sorry, I see the problem why. It's because there is a & ahead – Hoy Cheung Aug 29 '13 at 18:26
1 Answers
2
LINQ to XML is only designed to parse XML, not HTML. In fact, ?
and /
shouldn't cause a problem to LINQ to XML - although &
in unexpected places will, along with unclosed or unbalanced tags.
You should use something like HTML Tidy or HTML Agility Pack to parse HTML, unless you know that the HTML you want to parse is actually valid XML.

Jon Skeet
- 1,421,763
- 867
- 9,128
- 9,194
-
Thank you for the suggestion. Does HTML Tidy perform better than HTML Agility? HTML agility is so buggy. I have tried before. – Hoy Cheung Aug 29 '13 at 18:29
-
@user1978421: I don't know - I haven't used either myself, but have heard good things about both. Perhaps the bugs that you ran into have been fixed? – Jon Skeet Aug 29 '13 at 18:32
-