Effect of transport encoding on XML encoding and character references

Question

This question involves the interplay between the XML 1.0 and HTTP 1.1 recommendations.

I have a web service that accepts a well-formed XML 1.0 document, parses it, and re-serializes it back to the client. The service supports both Content-Type text/xml and application/xml.

Suppose the following document is submitted as Content-Type: text/plain; charset=us-ascii with Accept: text/plain and Accept-Charset: us-ascii:

<?xml version="1.0" encoding="UTF-8" ?>
<x>Inhoffenstra&#x00DF;e</x>

The above document is well-formed and satisfies the encoding requirement.

Once parsed, the XML DOM is UTF-8. Since the encoding of the document is also UTF-8, the document would be re-serialized as:

<?xml version="1.0" encoding="UTF-8" ?>
<x>Inhoffenstraße</x>

The above document is not compatible with the Accept-Charset header. However, there are at least three ways this request could be satisfied:

Serialize the DOM using encoding US-ASCII. This seems wrong and unnecessary because I am changing a fundamental property of the document, which may be misleading to the client (for instance, could this break something at the application layer, i.e., ESB/SOAP):
```
<?xml version="1.0" encoding="US-ASCII" ?>
<x>Inhoffenstra&#x00DF;e</x>
```
Post-process the serialized UTF-8 in the service layer by replacing non-ASCII characters with their Unicode character reference. This feels like a hack because XML-specific character encoding is being performed on the entire document using a non-XML-aware string transformation:
```
<?xml version="1.0" encoding="UTF-8" ?>
<x>Inhoffenstra&#x00DF;e</x>
```
Reject the request in the service layer as 406 Not Acceptable. This would assume that encoding="UTF-8" is in conflict with Accept-Charset: us-ascii. But, I don't think this is the case since the actual content of the request is composed entirely from ASCII characters.

What is the expected, standards-compliant behavior for the response? From my understanding of the referenced standards, any of the above might be acceptable.

The following answers to a different question provide some helpful information but do not specifically address the text/xml case:

application/* Content-Type and charset attributes

I'm linking the following question because I believe it stems from a related problem:

Escaping Unicode string in XmlElement despite writing XML in UTF-8

why do you submit as `text/plain` if you say that service accepts `text/xml` and `application/xml` ? — Testo Testini, Aug 18 '18 at 14:38

score 2 · Answer 1 · edited Oct 07 '21 at 11:04

Short answer

The standards-compliant response for the presented scenario is 415 Unsupported Media Type due to the conflict between the supported media type (text/xml, application/xml) and the media type of the payload (text/plain) in the request.

Explanation

Content-Type is defined in RFC7231 Section 3.1.1.5 as follows (emphasis mine):

The "Content-Type" header field indicates the media type of the associated representation: either the representation enclosed in the message payload or the selected representation, as determined by the message semantics. The indicated media type defines both the data format and how that data is intended to be processed by a recipient, within the scope of the received message semantics, after any content codings indicated by Content-Encoding are decoded.

Because the media type of the payload is text/plain, we must process the submitted document as plain text ("how that data is intended to be processed").

So how do we process plain text? Plain text is defined in RFC2046 Section 4.1 as follows:

Plain text does not provide for or allow formatting commands, font attribute specifications, processing instructions, interpretation directives, or content markup. Plain text is seen simply as a linear sequence of characters, possibly interrupted by line breaks or page breaks.

XML defines content markup, processing instructions and other things. Parsing a plain text document as XML, is a violation of the standards.

Lets take a look at your example:

<x>Inhoffenstra&#x00DF;e</x>

Converting ß to ß is what you would do if the document is XML, but if the document is plain text, it is a violation of RFC2046 and also RFC5147 that confirms how plain text should processed. As plain text, ß means ß, and nothing else.

In conclusion, none of the above possible responses you presented is standards-compliant. The standards-compliant response of the presented scenario is 415 Unsupported Media Type.

Effect of transport encoding on XML encoding and character references

1 Answers1

Short answer

Explanation