jtidy fails to parse html - options

Question

So I was trying to evaluate a couple of the HTML parsers and gave JTidy a try. Trying to parse this URL:

http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/TagNode.html

Gives these errors:

line 1 column 56,258 - Error: missing '>' for end of tag

line 1 column 56,258 - Error: is not recognized!

It says line one as it reads it in as one line, but this is the line that JTidy pukes/fails on:

      <li>//div[last() >= 4]//./div[position() = last()])[position() > 22]//li[2]//a</li>

My code is pretty simple:

import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;

Document document = tidy.parseDOM(new ByteArrayInputStream(this.getHtml().getBytes()), null);
NodeList anchorTags = document.getElementsByTagName("A");

Is this just a bug in JTidy or am I doing something wrong? I've evaluated about 6 others so far and none of them have had a problem on this page.

No idea as to the problem with `JTidy`. I use [JSoup](http://jsoup.org/) for my HTML parsing. It's excellent. — Steven, Apr 26 '13 at 00:54
JSoup is one of the packages I have done some sample evaluations with and really like it, I just want to go through most to give a detailed proposal to my tech review group. I'd at least like to say I had vetted JTidy and it has too many bugs to be used out of the box for even our simple needs at this point. — Jerry Skidmore, Apr 26 '13 at 01:00
I'd go with the "JTidy has too many bugs to be used out of the box" evaluation. :) — Steven, Apr 26 '13 at 01:06

jtidy fails to parse html - options

0 Answers0