I am using JTidy (the java port of the HTML Tidy library) to scrub some existing sites. When I used my configuration of JTidy is seems to be very strict and ends up cutting off the bottom of the page (bad markup).
When i run the same markup through the w3c HTML validator tool only, It cleans up it up but is more intelligent in its rewriting; instead of chopping off tags, it seems to intelligently guess where the missing tag was and updates the structure accordingly.
Does anyone know the HTML-Tidy configuration w3c uses?
My jtidy configuratio is as follows:
Tidy tidy = new Tidy();
tidy.setTidyMark(false);
tidy.setXHTML(true);
tidy.setXmlOut(false);
tidy.setNumEntities(true);
tidy.setSpaces(2);
tidy.setWraplen(2000);
tidy.setUpperCaseTags(false);
tidy.setUpperCaseAttrs(false);
tidy.setQuiet(false);
tidy.setMakeClean(true);
tidy.setShowWarnings(true);
tidy.setBreakBeforeBR(true);
tidy.setHideComments(true);