Force HTML Tidy to output XML (instead of XHTML), or force XSLTproc to parse XHTML files

Question

I have a large number of HTML files that I need to process with XSLT, using an XML file to choose which HTML files, and what we're doing with them.

I tried:

Use HTML Tidy to convert HTML -> XHTML / XML
Use document(filename) in XSLT to read in particular XHTML/XML files
...use standard nodeset commands to access e.g. "html/body/*"

This doesn't work, because:

It seems that XSLT (tried: libXSLT/xsltproc ... and Saxon) cannot process XHTML documents as external files (it sees the xhtml DOCTYPE, and refuses to parse it as nodes).

Fine (I thought) ... XHTML is just XML, I just need to put it through HTML Tidy and say:

"output-xml yes ... output-html no ... output-xhtml no"

...but HTML Tidy ignores you if you attempt that, and forces html instead :(. It seems to be hardcoded to only output XML files if the input was XML to begin with.

Any ideas for how to:

Force HTML Tidy to obey the command-line parameters, and set the doctype I asked for
Force XSLTproc to parse xhtml DOCTYPEs as xml
...some other cunning way that will work?

NB: this has to work on OS X - it's part of a build process for iOS apps. That shouldn't be a big problem, but e.g. any windows-only tools aren't available. I'd like to achieve this with standard open-source cross-platform tools (like tidy, libxslt, etc)

score 2 · Accepted Answer · edited Jun 05 '12 at 23:43

I finally discovered why XSLTproc / Saxon were refusing to parse the files if they were passed-in with a DOCTYPE html:

The DOCTYPE of the external document alters how they interpret the xmlns (namespace) directive. Tidy was declaring (correctly) "xmlns=...the xhtml: namespace" - so all my node-names were ... I don't know: non-existent? ... inside my XSLT. XSLT was just ignoring them, as if they didn't exist - it needed me to provide a compatible mapping to the same namespace

...strangely, if the DOCTYPE was xml, then they happily ignored the xmlns command - or they allowed me to reference nodes by unqualified name. This fooled me into thinking that they were point-blank ignoring the nodesets inside the xhtml DOCTYPE'd version.

So, the "solution" is something like this:

modify your XSLT stylesheet to ALSO import the "xhtml" namespace - NB: this is required so that you can reference the nodes in the external files
write all your XSL match / select / template rules with the "xhtml" prefix on every node (and every attribute, I think?)
let Tidy output whatever it wants: it doesn't matter, it'll Just Work, once you have the namespace support in there

Example code:

Your stylesheet goes from this:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

...to this:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">

Your select / match / document-import goes from this:

<xsl:copy-of select="document('html-files/file1.htm')/html/body"/>

...to this:

<xsl:copy-of select="document('html-files/file1.htm')/xhtml:html/xhtml:body"/>

NB: just to be clear: if you ignore namespaces, then it seems XSLT will work on files that are unDOCTYPED, even if they have a namespace in them. Don't make the mistake I made of thinking your XSLT is correct just because it appears to be :)

score 0 · Answer 2 · answered Sep 04 '11 at 19:18

0

XHTML is XML (if it is valid).

To get your XHTML processed as XML, you must not serve it as "text/html" MIME. Use application/xhtml+xml instead (keep in mind, that IE6 does not support to render this and will prompt a download window for your site).

In PHP do you serve it as xhtml+xml with the header() function.

I think this should do the trick:

header('Content-Type: application/xhtml+xml');

Does this help?

answered Sep 04 '11 at 19:18

breiti

1,145
9
17

If you were doing PHP (yourself) in the first place, you'd simply load the (X)HTML as `DOMDocument`, do the same for the XSLT stylesheet and use the PHP classes to take care of the transformation directly, rather than going through xsltproc. – user268396 Sep 04 '11 at 19:21
This is not a question about webservers, it's a question about doctypes. There isn't a webserver involved anywhere in the process. If you don't understand what a DOCTYPE is and how it affects XSLT processing, please go and look into that first. DOCTYPE is part of the document itself, whereas MIME type is not. – Adam Sep 04 '11 at 19:31
The problem is that an XHTML document may validate as an XML document, but according to XSLT, they are not the same thing. In particular, the document() function in XSLT is defined to work differently depending on which of the two doctypes it finds. I cannot find a way to force document() to treat everything as pure XML - if you know a way of doing that, I think that would fix the problem? – Adam Sep 04 '11 at 19:34

score 0 · Answer 3 · answered Sep 04 '11 at 19:24

0

If you run xsltproc --help, among the accepted input flags is a very conspicuous one called --html which supposedly tells xsltproc that:

--html: the input document is(are) an HTML file(s)

Presumably for this to work you must have valid HTML files to begin with, though. So you might want to tidy them up first.

answered Sep 04 '11 at 19:24

user268396

11,576
2
31
26

Thanks, I tried that, but that flag seems pretty useless - it causes ALL input files to be read as HTML, even though the main input file is XML (and so it causes XSLTproc to immediately crash, because every XML tag is invalid HTML :)) - when in fact we need to *selectively* say files are HTML, instead of XML. – Adam Sep 04 '11 at 20:02

Zach Young · Answer 4 · 2011-09-04T21:23:45.360

It's been a while, but I remember trying to use HTMLTidy to prep HTML files for XSLT and was disappointed by how easily it gave up while trying to "well form" the HTML. Then I found TagSoup, and was very pleased.

TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

I don't know if you're bound to HTMLTidy, but if not try this: http://home.ccil.org/~cowan/tagsoup/

As an example, here's a bad HTML file:

<body>
  <p>Testing
</body>

And here's the tagsoup command and its ouput:

~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html 
src: bad.html
<html><body>
  <p>Testing
</p></body></html>

Edit 01

Here is how tagsoup handles DOCTYPEs.

Here's a bad HTML file with a valid DOCTYPE:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<body>
  <p>Testing
</body>
</html>

Here's how tagsoup handles it:

~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html 
src: bad.html
<html><body>
  <p>Testing
</p></body></html>

It isn't until you explicitly pass a DOCTYPE to tagsoup that it attempts to output one:

~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html --doctype-public=html bad.html 
src: bad.html
<!DOCTYPE  PUBLIC "html" "">
<html><body>
  <p>Testing
</p></body></html>

I hope this helps,
Zachary

Seems I've got tidy / XSLTproc working together now (see my answer below) - but TagSoup looks like a nice tool to try too. Next time something goes wrong with Tidy, Ill definitely try TagSoup. However, from the docs, you cannot change the Doctype - only the Public and System elements of it (the 2 quoted strings at the end) ? — Adam, Sep 04 '11 at 20:05
@Adam: I'll be the first to admit that DOCTYPE's still mystify me, and that I "deal" with them only when I have to. That said, I might be missing something related to your question; but I think the above shows how to handle the DOCTYPE aspect of your question. The unspecified name-space is something else that trips me up from time-to-time still. — Zach Young, Sep 04 '11 at 21:25
The DOCTYPE woudl be: "!DOCTYPE html PUBLIC "something big goes here" "something big goes here"" ... i.e. the stuff in quotes is NOT what's used to establish the basic doctype. No? — Adam, Sep 05 '11 at 19:01

Emiliano Poggi · Answer 5 · 2011-09-04T20:17:37.347

0

I think the main problem is given by the XML catalog doctype declaration. You can test this by removing the external entity reference in the input XHTML and see if the processor correctly works with it.

I would do as follows:

Use Tidy with doctype omit option.
Add the Doctype at XSLT side as described here

The main problem is that Saxon and xsltproc has not any option to disable external entities resolution. This is supported by MSXSL.exe command line utility with option -xe.

edited Sep 04 '11 at 20:17

answered Sep 04 '11 at 19:47

Emiliano Poggi

24,390
8
55
67

Yes - the '-xe' option sounds useful. I tried the doctype omit, and as you said it caused all external entities to go horribly wrong. How would you "add the doctype at XSLT side", though? I kept thinking: "if document() had an extra paramter that let me override the incoming DOCTYPE, that would work", but it doesn't :(. – Adam Sep 04 '11 at 20:04
I added a link describing how to handle doctype at xslt side. Check my answer now. – Emiliano Poggi Sep 04 '11 at 20:08
So ... if I understand correctly: by adding a doctype in the XSLT, and somehow getting all your entity mappings into the XSLT (write them out by hand?), you'd then be able to skip the problem where your external documents were full of unexpected entities? – Adam Sep 04 '11 at 20:12
Not really. I was simply suggesting to add DTD references at XSLT side as [follows](http://www.dpawson.co.uk/xsl/sect2/N2281.html#d3805e19), making sure (with -omit option) not to include DTD references in the input document that will be parsed by the processor. – Emiliano Poggi Sep 04 '11 at 20:17

Force HTML Tidy to output XML (instead of XHTML), or force XSLTproc to parse XHTML files

5 Answers5