I have a large number of HTML files that I need to process with XSLT, using an XML file to choose which HTML files, and what we're doing with them.
I tried:
- Use HTML Tidy to convert HTML -> XHTML / XML
- Use document(filename) in XSLT to read in particular XHTML/XML files
- ...use standard nodeset commands to access e.g. "html/body/*"
This doesn't work, because:
- It seems that XSLT (tried: libXSLT/xsltproc ... and Saxon) cannot process XHTML documents as external files (it sees the xhtml DOCTYPE, and refuses to parse it as nodes).
Fine (I thought) ... XHTML is just XML, I just need to put it through HTML Tidy and say:
"output-xml yes ... output-html no ... output-xhtml no"
...but HTML Tidy ignores you if you attempt that, and forces html instead :(. It seems to be hardcoded to only output XML files if the input was XML to begin with.
Any ideas for how to:
- Force HTML Tidy to obey the command-line parameters, and set the doctype I asked for
- Force XSLTproc to parse xhtml DOCTYPEs as xml
- ...some other cunning way that will work?
NB: this has to work on OS X - it's part of a build process for iOS apps. That shouldn't be a big problem, but e.g. any windows-only tools aren't available. I'd like to achieve this with standard open-source cross-platform tools (like tidy, libxslt, etc)