Why must < be escaped in an XML attribute?

Question

I wonder a bit why < must be escaped in an XML attribute, e.g.

<foo bar="3 < 4" />

From the surrounding (inside a tag, inside an attribute value) it should be quite clear for a parser that it can't be the beginning of a new tag.

What is the reason the XML specification prohibits this?

kjhughes · Accepted Answer · 2021-11-06T21:41:58.020

A less than character (<) must indeed be escaped within attribute values:

Well-Formedness Constraint: No < in Attribute Values

The replacement text of any entity referred to directly or indirectly in an attribute value (other than "<") must not contain a <.

Why?

As you observe, attribute values containing < can be unambiguously parsed. However, the motivation was to make XML's parsing rules as simple as possible...

According to Tim Bray, one of the XML 1.0 W3C Recommendation editors and author of The Annotated XML Specification, which captures some of the rationale behind XML design decisions:

Banishing the <

This rule might seem a bit unnecessary, on the face of it. Since you can't have tags in attribute values, having an < can hardly be confusing, so why ban it?

This is another attempt to make life easy for the DPH. The rule in XML is simple: when you're reading text, and you hit a <, then that's a markup delimiter. Not just sometimes, always. When you want one in the data, you have to use <. Not just sometimes, always. In attribute values too.

This rule has another unintended beneficial side-effect; it makes the catching of certain errors much easier. Suppose you have a chunk of XML as follows:

<a href="notes.html> <img src='notes.gif'></a>

Notice that the notes.html is missing its closing quote. Without the no-< rule, it would be really hard to detect this problem and issue a reasonable error message. Since attribute values can contain almost anything, no error would be detected until the processor finds the next quotation mark. Instead, you get an error message the first time you hit a <, which in the example above, as in many cases, is almost immediately.

Back-link to spec

Tim Bray's rationale rather overlooks the fact that `<` is allowed in the content of comments and processing instructions... — Michael Kay, Nov 06 '21 at 22:16
@MichaelKay ... and also that `>` *is* permitted, which also complicates the life of the DPH. But a not very good rationale is still a rationale, and this quote seems to provide an objective answer to the question "What is the reason", as opposed to the more subjective question "Should `<` be excluded?", which I don't think can be answered within SO's terms of reference. — rici, Nov 07 '21 at 01:05

score -1 · Answer 2 · answered Nov 06 '21 at 18:32

-1

I don't know precisely, but in many cases the explanation is SGML-compatibility. XML was designed to be a subset of SGML, and therefore didn't allow things that SGML didn't allow.

answered Nov 06 '21 at 18:32

Michael Kay

156,231
11
92
164

1

But SGML allows arbitrary characters (other than the terminating quote) in attribute values of type CDATA. PCDATA is not one of the attributes formats. (And don't ask why I still have a copy of the SGML handbook on my bookshelf.) – rici Nov 06 '21 at 23:10

Why must < be escaped in an XML attribute?

2 Answers2

Why?

Linked

Related