A less than character (<
) must indeed be escaped within attribute values:
Well-Formedness Constraint: No <
in Attribute Values
The replacement text of any entity referred to directly or indirectly
in an attribute value (other than "<
") must not contain a <
.
Why?
As you observe, attribute values containing <
can be unambiguously parsed. However, the motivation was to make XML's parsing rules as simple as possible...
According to Tim Bray, one of the XML 1.0 W3C Recommendation editors and author of The Annotated XML Specification, which captures some of the rationale behind XML design decisions:
Banishing the <
This rule might seem a bit unnecessary, on the face
of it. Since you can't have tags in attribute values, having an < can
hardly be confusing, so why ban it?
This is another attempt to make life easy for the DPH. The rule in XML
is simple: when you're reading text, and you hit a <
, then that's a
markup delimiter. Not just sometimes, always. When you want one in the
data, you have to use <
. Not just sometimes, always. In attribute
values too.
This rule has another unintended beneficial side-effect; it makes the
catching of certain errors much easier. Suppose you have a chunk of
XML as follows:
<a href="notes.html> <img src='notes.gif'></a>
Notice that the notes.html is missing its closing quote. Without the
no-<
rule, it would be really hard to detect this problem and
issue a reasonable error message. Since attribute values can contain
almost anything, no error would be detected until the processor finds
the next quotation mark. Instead, you get an error message the first
time you hit a <
, which in the example above, as in many cases, is
almost immediately.
Back-link to spec