0

So I was trying to evaluate a couple of the HTML parsers and gave JTidy a try. Trying to parse this URL:

http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/TagNode.html

Gives these errors:

line 1 column 56,258 - Error: missing '>' for end of tag

line 1 column 56,258 - Error: is not recognized!

It says line one as it reads it in as one line, but this is the line that JTidy pukes/fails on:

      <li>//div[last() >= 4]//./div[position() = last()])[position() > 22]//li[2]//a</li>

My code is pretty simple:

import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;

Document document = tidy.parseDOM(new ByteArrayInputStream(this.getHtml().getBytes()), null);
NodeList anchorTags = document.getElementsByTagName("A");

Is this just a bug in JTidy or am I doing something wrong? I've evaluated about 6 others so far and none of them have had a problem on this page.

ollo
  • 24,797
  • 14
  • 106
  • 155
Jerry Skidmore
  • 400
  • 2
  • 7
  • 20
  • 1
    No idea as to the problem with `JTidy`. I use [JSoup](http://jsoup.org/) for my HTML parsing. It's excellent. – Steven Apr 26 '13 at 00:54
  • JSoup is one of the packages I have done some sample evaluations with and really like it, I just want to go through most to give a detailed proposal to my tech review group. I'd at least like to say I had vetted JTidy and it has too many bugs to be used out of the box for even our simple needs at this point. – Jerry Skidmore Apr 26 '13 at 01:00
  • I'd go with the "JTidy has too many bugs to be used out of the box" evaluation. :) – Steven Apr 26 '13 at 01:06

0 Answers0