2

I've got an arbitrary XHTML document which are usually not well formed, since websites can be made like that and browser will show it. How can I support XSLT translation for not well formed XHTML code? Is there a way that it can avoid those parts which are not well formed?

I have this code in Java, but as I've said it's not supporting not well formed XHTML:

try {
            TransformerFactory tFactory=TransformerFactory.newInstance();

            Source xslDoc=new StreamSource("path1");
            Source xmlDoc=new StreamSource("path2");

            String outputFileName="path3";

            OutputStream htmlFile=new FileOutputStream(outputFileName);
            Transformer trasform=tFactory.newTransformer(xslDoc);
            trasform.transform(xmlDoc, new StreamResult(htmlFile));
        } 
catch (Exception e) {...}
Tommz
  • 3,393
  • 7
  • 32
  • 44
  • You can try to fix your not-well-formed XHTML using [JTidy](http://jtidy.sourceforge.net/). – helderdarocha Feb 21 '14 at 17:17
  • Take a look at this http://stackoverflow.com/questions/2547000/proper-usage-of-jtidy-to-purify-html?rq=1 – helderdarocha Feb 21 '14 at 17:20
  • Isn't there a way to support "not well formed XHTML translation" with Transformer? It's not about "my" XHTML - I could make my XHTML well-formed, but since I'm parsing sites, I can't expect that these XHTML would be always well-formed. Also, I don't know how this JTidy would make the same "tidying" as browsers are making and wouldn't be much for performanse. – Tommz Feb 21 '14 at 17:31
  • The native Java XML parsers require the XML to be well-formed, and XSLT parsers assume the source is well-formed XML. If it's not well-formed you can use an HTML parser. – helderdarocha Feb 21 '14 at 17:52

2 Answers2

2

You can use JSoup library to parse and fix your HTML and then use XSLT.

Jakub H
  • 2,130
  • 9
  • 16
  • I tried, but it's still not working. I used Cleaner.clean() and JSoup.clean() but both are not wanting go parse through not-closed elements. – Tommz Feb 21 '14 at 18:34
1

You can try to use an HTML parser like http://about.validator.nu/htmlparser/ or like TagSoup.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110