3

I'm trying to clean some htmls. I have converted them to xhtml with tidy

$ tidy -asxml -i -w 150 -o o.xml index.html

The resulting xhtml ends up having named entities. When trying xsltproc on those xhtmls, I keep getting errors.

$ xsltproc --novalid  -o out.htm  t.xsl o.xml
o.xml:873: parser error : Entity 'mdash' not defined
            resources to storing data and using permissions &mdash; as needed.</
                                                                   ^
o.xml:914: parser error : Entity 'uarr' not defined
        </div><a href="index.html#top" style="float:right">&uarr; Go to top</a>
                                                                 ^
o.xml:924: parser error : Entity 'nbsp' not defined
          Android 3.2&nbsp;r1 - 27 Jul 2011 12:18

If I add --html to the xsltproc it complains on a tag that has name and id attributes with same name (which is valid)

$ xsltproc --novalid --html -o out.htm  t.xsl o.xml o.xml:845: element a: validity error : ID top already defined
      <a name="top" id="top"></a>
                            ^

The xslt is simple:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes" omit-xml-declaration="yes"/>

    <xsl:template match="node()|@*">
      <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
      </xsl:copy>
    </xsl:template>

    <xsl:template match="//*[@id=side-nav]"/>
</xsl:stylesheet>

Why doesn't --html work? Why is it complaining? Or should I forget it and fix the entities?

vangop
  • 175
  • 2
  • 8

2 Answers2

1

I did the other way - made tidy produce numeric entities rather then named with -n option.

$ tidy -asxml -i  -n -w 150 -o o.xml index.xml

Now I can remove --html option and it works. Although I can remove that name attribute, but still wonder why it is reported as an error, although it is valid

vangop
  • 175
  • 2
  • 8
  • It is not valid. From the page that you linked to: "The `id` and `name` attributes share the same name space. This means that they cannot both define an anchor with the same name in the same document". – mzjn Jul 31 '11 at 18:35
  • No, read further, "The following example illustrates that id and name must be the same when both appear in an element's start tag:.." – vangop Jul 31 '11 at 18:39
  • Isn't this about XHTML (which is XML)? xsltproc is an XML tool and it is just applying XML rules which state that there can be only one attribute of type `ID` per element. See http://www.w3.org/TR/xhtml1/#h-4.10. – mzjn Jul 31 '11 at 18:56
0

I am assuming that the unclearly stated question is this: I know how to avoid "Entity 'XXX' not defined" errors when running xsltproc (add --html). But how do I get rid of "ID YYY already defined"?

Recent builds of Tidy have an anchor-as-name option. You can set it to "no" to remove unwanted name attributes:

This option controls the deletion or addition of the name attribute in elements where it can serve as anchor. If set to "yes", a name attribute, if not already existing, is added along an existing id attribute if the DTD allows it. If set to "no", any existing name attribute is removed if an id attribute exists or has been added.

mzjn
  • 48,958
  • 13
  • 128
  • 248
  • Do I really need the --html option? What does it do? I couldn't find any details on it. – vangop Jul 31 '11 at 18:43
  • I don't really know more about the `--html` switch than [this](http://xmlsoft.org/XSLT/xsltproc2.html). I suppose that it should be used when working with HTML that is not well-formed and that might contain references to entities that are predefined in (X)HTML but not in XML. – mzjn Jul 31 '11 at 19:53