How to support not well formed XHTML for XSLT

Question

I've got an arbitrary XHTML document which are usually not well formed, since websites can be made like that and browser will show it. How can I support XSLT translation for not well formed XHTML code? Is there a way that it can avoid those parts which are not well formed?

I have this code in Java, but as I've said it's not supporting not well formed XHTML:

try {
            TransformerFactory tFactory=TransformerFactory.newInstance();

            Source xslDoc=new StreamSource("path1");
            Source xmlDoc=new StreamSource("path2");

            String outputFileName="path3";

            OutputStream htmlFile=new FileOutputStream(outputFileName);
            Transformer trasform=tFactory.newTransformer(xslDoc);
            trasform.transform(xmlDoc, new StreamResult(htmlFile));
        } 
catch (Exception e) {...}

You can try to fix your not-well-formed XHTML using [JTidy](http://jtidy.sourceforge.net/). — helderdarocha, Feb 21 '14 at 17:17
Take a look at this http://stackoverflow.com/questions/2547000/proper-usage-of-jtidy-to-purify-html?rq=1 — helderdarocha, Feb 21 '14 at 17:20
Isn't there a way to support "not well formed XHTML translation" with Transformer? It's not about "my" XHTML - I could make my XHTML well-formed, but since I'm parsing sites, I can't expect that these XHTML would be always well-formed. Also, I don't know how this JTidy would make the same "tidying" as browsers are making and wouldn't be much for performanse. — Tommz, Feb 21 '14 at 17:31
The native Java XML parsers require the XML to be well-formed, and XSLT parsers assume the source is well-formed XML. If it's not well-formed you can use an HTML parser. — helderdarocha, Feb 21 '14 at 17:52

score 2 · Accepted Answer · answered Feb 21 '14 at 17:23

2

You can use JSoup library to parse and fix your HTML and then use XSLT.

answered Feb 21 '14 at 17:23

Jakub H

2,130
9
16

I tried, but it's still not working. I used Cleaner.clean() and JSoup.clean() but both are not wanting go parse through not-closed elements. – Tommz Feb 21 '14 at 18:34

score 1 · Answer 2 · answered Feb 21 '14 at 17:21

1

You can try to use an HTML parser like http://about.validator.nu/htmlparser/ or like TagSoup.

answered Feb 21 '14 at 17:21

Martin Honnen

160,499
6
90
110

I'm trying with TagSoup but it can't make it work. Do you have some example? – Tommz Feb 21 '14 at 19:19

How to support not well formed XHTML for XSLT

2 Answers2