I have a large block of programmatically generated HTML. I ran it through Tidy (version r938) with the following Java code:
StringReader inStr = new StringReader(htmlInput);
StringWriter outStr = new StringWriter();
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parseDOM(inStr, outStr);
I get the following output:
InputStream: Document content looks like HTML 4.01 Transitional
247 warnings, 3 errors were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
Trouble is, Tidy doesn't tell me what 3 errors it found.
I'm fibbing here a little. The output above actually follows a long list of all 247 warnings (mostly trimming out empty div
elements). I can suppress those with tidy.setShowWarnings(false)
; either way, I see no error report, so I can't figure out what I need to fix. 300Kb of HTML is too much for me to eyeball.
I've tried numerous approaches to finding the error. I can't run it through validate.w3.org, sadly, as the HTML file is on a proprietary network. The most informative approach was to open it in IntelliJ IDEA; this revealed a dozen or so duplicate div IDs, which I fixed. Errors still occurred.
I've looked around for other mentions of this problem. While I find plenty of hits on things like "How can I get the error/warning messages out of the parsed HTML using JTidy?", they all appear to be asking for dissimilar things, or assume conditions that simply aren't holding for me. I'm getting warnings just fine, for example; it's the errors I need, and they're not being reported, even if I call setShowErrors(100)
or something.
Am I going to have to dive into Tidy's source code and debug it, starting where it reports errors? Or is there something much simpler I could do?