I needed an XML validation of potentially HTML 5. HTML 4 and XHTML only had a mediocre 250 or so entities, while the current draft (January 2012) has more than 2000.
GET 'http://www.w3.org/TR/html5-author/named-character-references.html' |
xmllint --html --xmlout --format --noent - |
egrep '<code|<span.*glyph' | # get only the bits we're interested in
sed -e 's/.*">/__/' | # Add some "__" markers to make e.g. whitespace
sed -e 's/<.*/__/' | # entities work with xargs
sed 's/"/\"/' | # xmllint output contains " which messes up xargs
sed "s/'/\'/" | # ditto apostrophes. Make them HTML entities instead.
xargs -n 2 echo | # Put the entity names and values on one line
sed 's/__/<!ENTITY /' | # Make a DTD
sed 's/;__/ /' |
sed 's/ __/"/' |
sed 's/__$/">/' |
egrep -v '\bapos\b|\bquot\b|\blt\b|\bgt\b|\bamp\b' # remove XML entities.
You end up with a file containing 2114 entities.
<!ENTITY AElig "Æ">
<!ENTITY Aacute "Á">
<!ENTITY Abreve "Ă">
<!ENTITY Acirc "Â">
<!ENTITY Acy "А">
<!ENTITY Afr "𝔄">
Plugging this into an XML parser should allow the XML parser to resolve these character entities.
Update October 2012: Since the working draft now has a JSON file (yes, I'm still using regular expressions) I worked it down to a single sed:
curl -s 'http://www.w3.org/TR/html5-author/entities.json' |
sed -n '/^ "&/s/"&\([^;"]*\)[^0-9]*\[\([0-9]*\)\].*/<!ENTITY \1 "\&#\2;">/p' |
uniq
Of course a javascript equivalent would be a lot more robust, but not everyone has node installed. Everyone has sed, right? Random sample output:
<!ENTITY subsetneqq "⫋">
<!ENTITY subsim "⫇">
<!ENTITY subsub "⫕">
<!ENTITY subsup "⫓">
<!ENTITY succapprox "⪸">
<!ENTITY succ "≻">