11

Does XHTML5 support character entities such as   and —. At work we can require specific software to access the admin side of the site, and people are demanding multi-file-upload. For me this is an easy justification to require migrating to FF 3.6+, so I'll be doing it soonish. We currently use XHTML 1.1, and upon moving to HTML5, I'm only having issues with character entity names... Does anyone have a doc on this?

I see there is a list on the WHATWG spec but I'm not sure if it affects files served as application/xhtml+xml. By any means the two mentioned trigger errors in both Chromium nightly and FF 3.6.

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468

5 Answers5

13

There is no DTD for XHTML5, so an XML parser will see no entity definitions (other than the predefined ones). If you wanted to use an entity you would have to define it for yourself in the internal subset.

<!DOCTYPE html [
    <!ENTITY mdash "—">
]>
<html xmlns="http://www.w3.org/1999/xhtml">
    ... &mdash; ...
</html>

(Of course using the internal subset is likely to trip browsers up if you serve it to them as text/html. Sending an internal subset in a non-XHTML HTML5 document is disallowed.)

The HTML5 wiki currently recommends:

Do not use entity references in XHTML (except for the 5 predefined entities: &amp;, &lt;, &gt;, &quot; and &apos;)

And I agree with this advice not just for XHTML5 but for XML and HTML in general. There's little reason to be using the HTML entities for anything today. Unicode characters typed directly are far more readable for everyone, and &#...; character references are available for those sad cases when you can't guarantee a 8-bit/encoding-clean transport. (Since HTML entities are not defined for the majority of Unicode characters, you are going to need those anyway.)

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 5
    If you want readability, just type a ‘—’ character. There's no use point trying to learn all the HTML entity names. Use the real character; paste it from the character map if you have to, but there are easier ways to input these characters if you do it a lot. (On my keyboard, shift-alt-minus produces it, for example.) – bobince Jul 09 '10 at 18:02
  • 1
    I upvoted that comment, because it is true, but what about " " How is that less readable than ` ` – Evan Carroll Jul 09 '10 at 18:05
  • 1
    It would seem better if they would just formalize [these](http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references) into the internal HTML5 DTD, rather than leave it empty. – Evan Carroll Jul 09 '10 at 18:06
  • There is no HTML5 DTD, empty or otherwise, XML-based or otherwise! WHATWG took the position that DTD was an outmoded and insufficient schema language to describe HTML5. (And it is, it's bloody awful. The XML version is a bit more sane than the horrific SGML original, but still plenty nasty.) So HTML5 defines a new, non-SGML serialisation for plain-HTML that has many predefined entities. But for the XML serialisation XHTML5, no such strategy is possible as the only way to have an entity in XML is with a DTD (internal or external). – bobince Jul 09 '10 at 18:35
  • 2
    Which is why most XML users today never use entity references. Here's a more readable non-breaking space for you: ‘ ’. (Shift-space on my keyboard, FWIW!) – bobince Jul 09 '10 at 18:36
  • 5
    Right, unfortunately it looks no different to the eye reading the source. – Evan Carroll Jul 09 '10 at 19:22
  • 1
    @bobince: RE: *"There is no DTD for XHTML5.* I believe your answer may need a refresh. In the W3C HTML5 Recommendation, section [9.2 Parsing XHTML documents](http://www.w3.org/TR/html5/the-xhtml-syntax.html#parsing-xhtml-documents) states: "This specification provides the following additional information that user agents should use when retrieving an external entity:…(This URL is a **DTD** containing the **entity declarations** for the names listed in the **named character references** section.)" – DavidRR May 05 '15 at 13:53
6

I needed an XML validation of potentially HTML 5. HTML 4 and XHTML only had a mediocre 250 or so entities, while the current draft (January 2012) has more than 2000.

GET 'http://www.w3.org/TR/html5-author/named-character-references.html' |
xmllint --html --xmlout --format --noent - | 
egrep '<code|<span.*glyph' |  # get only the bits we're interested in
sed -e 's/.*">/__/' | # Add some "__" markers to make e.g. whitespace
sed -e 's/<.*/__/' |  #  entities work with xargs
sed 's/"/\&quot;/' | # xmllint output contains " which messes up xargs
sed "s/'/\&apos;/" | # ditto apostrophes. Make them HTML entities instead.
xargs -n 2 echo |  # Put the entity names and values on one line
sed 's/__/<!ENTITY /' | # Make a DTD
sed 's/;__/ /' |
sed 's/ __/"/'  |
sed 's/__$/">/' |
egrep -v '\bapos\b|\bquot\b|\blt\b|\bgt\b|\bamp\b' # remove XML entities.

You end up with a file containing 2114 entities.

<!ENTITY AElig "&#xC6;">
<!ENTITY Aacute "&#xC1;">
<!ENTITY Abreve "&#x102;">
<!ENTITY Acirc "&#xC2;">
<!ENTITY Acy "&#x410;">
<!ENTITY Afr "&#x1D504;">

Plugging this into an XML parser should allow the XML parser to resolve these character entities.

Update October 2012: Since the working draft now has a JSON file (yes, I'm still using regular expressions) I worked it down to a single sed:

curl -s 'http://www.w3.org/TR/html5-author/entities.json' |
sed -n '/^  "&/s/"&\([^;"]*\)[^0-9]*\[\([0-9]*\)\].*/<!ENTITY \1 "\&#\2;">/p' |
uniq

Of course a javascript equivalent would be a lot more robust, but not everyone has node installed. Everyone has sed, right? Random sample output:

<!ENTITY subsetneqq "&#10955;">
<!ENTITY subsim "&#10951;">
<!ENTITY subsub "&#10965;">
<!ENTITY subsup "&#10963;">
<!ENTITY succapprox "&#10936;">
<!ENTITY succ "&#8827;">
mogsie
  • 4,021
  • 26
  • 26
2

My best advice is to not upgrade to HTML5 or XHTML5 until support for character entity names is provided.

Anyone who thinks that &#12345; makes more sense than &mdash; needs a brain upgrade. Most people can't remember huge tables of numbers.

Those of us who have to remain with older operating systems to be compatible with existing scientific, real-time, or point-of-sale hardware (or government networks) can't just type the character or pick it from a list. It won't save correctly in the file.

The reason this has been imposed on us is that w3c no longer wants the expense of serving DTD files, so we must go back to the stone age.

Nothing like this that has been provided should ever be deprecated.

midimagic
  • 45
  • 1
2

The right answer (the modern way)

I asked this question five years ago. Now every browser supports UTF-8. And, every inception of UTF-8 includes glyph support for all named character entities. The rightmost current solution to this problem is not to use named entities at all but to serve only UTF-8 (strict) and to use actually characters in that.

This is a list of all XML entities. All of these have UTF-8 character alternatives -- and that's how they'd normally be rendered anyway.

For instance, take

U+1D6D8, MATHEMATICAL BOLD SMALL CHI            , b.chi

I suppose in some variant of xml you could have &b.chi or something, searching for MATHEMATICAL BOLD SMALL CHI you'll find some page on fileformat.info, which has the character listed.

Alternatively, in Windows you can type Alt + 1 D 6 D 8 (the 1d68d comes from the table of XML entities), or in Linux Ctrl + Shift + u 1 D 6 D 8.

This will put the character into your document the right way.

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
0

Using the following answer: https://stackoverflow.com/a/9003931/689044 , I created the file and posted it as a Gist on GitHub: https://gist.github.com/cerkit/c2814d677854308cef57 for those of you who need the Entities in a file.

I used it successfully with ASP.NET MVC by loading the text file into the Application object and using that value with my (well-formed) HTML to parse a System.Xml.XmlDocument.

XmlDocument doc = new XmlDocument();

// load the HTML entities into the document and add a root element so it will load
// The HTML entities are required or it won't load the document if it uses any entities (ex: &ndash;)
doc.LoadXml(string.Format("{0}<root>{1}</root>", Globals.HTML_ENTITIES, control.HtmlText));
var childNodes = doc.SelectSingleNode("//root").ChildNodes;
// do your work here    
foreach(XmlNode node in childNodes)
{
    // or here
}

Globals.HTML_ENTITIES is a static property that loads the entities from the text file and stores them in the Application object, or it uses the values if they're already loaded in the Application object.

public static class Globals
{   
    public static readonly string APPLICATION_KEY_HTML_ENTITIES = "HTML_ENTITIES";

    public static string HTML_ENTITIES
    {
        get
        {
            string retVal = null;
            // load the HTML entities from a text file if they're not in the Application object
            if(HttpContext.Current.Application[APPLICATION_KEY_HTML_ENTITIES] != null)
            {
                retVal = HttpContext.Current.Application[APPLICATION_KEY_HTML_ENTITIES].ToString();
            }
            else
            {
                using (StreamReader sr = File.OpenText(HttpContext.Current.Server.MapPath("~/Content/HtmlEntities/RootHtmlEntities.txt")))
                {
                    retVal = sr.ReadToEnd();
                    HttpContext.Current.Application[APPLICATION_KEY_HTML_ENTITIES] = retVal;
                }
            }

            return retVal;
        }
    }
}

I tried creating a long string to hold the values, but it kept crashing Visual Studio, so I decided that the best route would be to load the text file at runtime and store it in the Application object.

Community
  • 1
  • 1
Michael Earls
  • 1,467
  • 1
  • 15
  • 25