0

One email provider rejected an email containing special characters (e.g. umlaute). They say that they are RFC-5321 and RFC-5322 compliant. Now I browsed those standards however they are not supporting international emails (thus no umlaute). Only ASCII-127 is supported. Now there is an extension called RFC-6532 which standardizes international emails. Our emails are UTF-8 (quoted-printable) encoded and sent like this:

"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller@foo.org>

Is this an RFC-6532 compliant address? Or is it some other/older RFC (like RFC-2054)? After all there are so many mail related RFCs that I might have missed 10 or 20 ;-)

Lonzak
  • 9,334
  • 5
  • 57
  • 88

1 Answers1

2

It's on the right track, but it's wrong.

"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller@foo.org>

There are 2 problems with the above form:

  1. The encoded-word (the =?UTF-8?Q?...?= bit) is quoted and shouldn't be. Mail software that parse this address won't decode that token if they are standards-compliant.
  2. The "name" is butted up against the angle brackets and should not be. There MUST be a space in order to be standards compliant.

In other words, this is what it should look like:

=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= <boerge.moeller@foo.org>

The RFCs that you need to look at are:

  • RFC5322 - this defines the modern Message syntax that is implemented by the server you are trying to interoperate with.
  • RFC2047 - this defines the methods and syntax of the encoded-words that are needed to represent non-ASCII characters in headers like Subject and address headers (e.g. To/From/Cc/Reply-To/etc). (This is the =?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= part)
  • RFC822 - this defines the grammar used by RFC2047 and is an older version of RFC5322.

It may also be helpful to read RFC2822 which is newer than RFC822 but older than RFC5322. My guess, however, is that you can skip it because it won't have a lot of value. The only reason RFC822 still has value is because of its much older grammar definitions that are referenced by RFC2047 (such as atom, dot-atom, phrase, angle-addr, addr-spec, tspecials, etc).

RFC6532 is even newer than RFC5322. The purpose of which is to remove the need to encode headers altogether by allowing the use of UTF-8 as an alternative.

Before RFC6532, there was no standard for the character encoding to use for headers other than ASCII (which was what RFC822 used) and so everything was always supposed to conform to ASCII.

A lot of software doesn't follow the standards, however, and so there was a lot of mail in the real world that used ISO-8859-1 and every other character encoding under the sun, all depending on what region the user(s) were in and what character encoding(s) were in wide use in those regions (e.g. Big5 and GB2312 are popular in various parts of China, Shift-JIS being popular in Japan, EUC-KR/KS-C-5601-1987 are popular in Korea, etc).

This caused major interoperability problems, though, not least of which because not every mail client could handle every character encoding under the sun, but also because there was no way for a client to figure out which character encoding was even being used! It's all just binary gobbeldy-gook.

UTF-8, however, has existed for a long time and it can represent all characters in all languages, so it was only logical for it to eventually win out as the standard character encoding to use for international email.

Lonzak
  • 9,334
  • 5
  • 57
  • 88
jstedfast
  • 35,744
  • 5
  • 97
  • 110
  • Thank you for your clarification. But is the "corrected" form `=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= ` RFC-6532 compliant? – Lonzak Nov 05 '21 at 15:46
  • The way specifications work is that if you are compliant with the older spec, you are compliant with the newer spec. In other words, if your message conforms to RFC5322, it *also* complies with RFC6532. That said, if you encode the name of the sender/recipient at all, it is not *implementing* RFC6532. Does that make sense? – jstedfast Nov 05 '21 at 15:48
  • If a client sends out a message *implementing* RFC6532, it would look like this: `From: Börge Möller ` but said message would not be accepted by software that only implements RFC5322. – jstedfast Nov 05 '21 at 15:51
  • Oh is that confusing :-) I thought RFC5322 is the latest standard regarding emails, however it is not defining aspects of encoding/international mails. This is defined in RFC-6532 which might be newer but has a completely separate topic so to speak. I understand the example in your last comment and that our example address is not RFC-6532. But then in which RFC is this UTF-8 encoding syntax defined? Or is there none? – Lonzak Nov 05 '21 at 16:38
  • 1
    You sound very confused :-) RFC5322 is the standard you should focus on complying with. Not RFC6532. RFC6532 is the standard that says "ok guys, it's acceptable to not encode your email headers anymore as long as you use UTF-8", essentially. – jstedfast Nov 05 '21 at 16:40
  • 2
    So just to clarify once more, RFC5322 is perfectly able to accommodate arbitrary character sets, but you have to encode them into an ASCII representation (using a MIME encoding such as RFC2047 for headers, as illustrated in this answer). The promise of 6532 is that allows you to use bare UTF-8 in the headers; but few systems implement this, still. – tripleee Nov 05 '21 at 16:40
  • Thank you two for the clarification. I understand now the intent of RFC-6532. One question remains though: In the example email everything is ASCII but the part: `=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?=` is encoded. Where is it defined when a mailserver receives this special syntax knows what to do with it? I mean it is encoded and not like `Boerge Moeller <...>`? Or is this also part of RFC-5322? – Lonzak Nov 05 '21 at 17:20
  • Mail *servers* typically don't need to implement RFC2047 decoding because they don't care about the human readable portions of the headers. They only care about the addresses which means they really only need to parse the address headers according to the RFC5322 specifications. Mail *clients* implement RFC2047 decoding because they want to be able to display the human readable name tied to the email address. – jstedfast Nov 05 '21 at 17:21
  • 1
    And so once more just to spell out what's already been repeated multiple times, the `=?charset(*lang)?[BQ]?...?=` encoding is defined in RFC 2047. – tripleee Nov 08 '21 at 08:29
  • 1
    Perhaps also mention that RFC 6532 explicitly states that _"messages in this [RFC 6532] format require the use of the `SMTPUTF8` extension [RFC6531] to be transferred via SMTP."_ In theory, your MUA might submit a message in this format and expect your MTA to reformat it for 7-bit transfer _if and only if_ your local ESMTP server advertises `SMTPUTF8` support. – tripleee Nov 08 '21 at 08:35
  • Thanks to you all! @tripleee Don't worry - if there is a correct answer I do accept it... – Lonzak Nov 08 '21 at 15:13