So I was trying to evaluate a couple of the HTML parsers and gave JTidy a try. Trying to parse this URL:
http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/TagNode.html
Gives these errors:
line 1 column 56,258 - Error: missing '>' for end of tag
line 1 column 56,258 - Error: is not recognized!
It says line one as it reads it in as one line, but this is the line that JTidy pukes/fails on:
<li>//div[last() >= 4]//./div[position() = last()])[position() > 22]//li[2]//a</li>
My code is pretty simple:
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;
Document document = tidy.parseDOM(new ByteArrayInputStream(this.getHtml().getBytes()), null);
NodeList anchorTags = document.getElementsByTagName("A");
Is this just a bug in JTidy or am I doing something wrong? I've evaluated about 6 others so far and none of them have had a problem on this page.