I've been doing a lot of reading the RFC and watching some YouTubes on Context-Free Grammar and I believe, for my use case, I've come to a reasonable conclusion and would like input.
First, I'm starting with a simple production for an email address:
addr-spec = local-part "@" domain
where domain is clearly defined in RFC 1035 and I'll be focusing on local-part
here. Per RFC 1035, domain
can easily be represented in a regular expression as ^(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$
(use your choice of string start/end anchors).
local-part = dot-atom / quoted-string / obs-local-part
local-part
gets interesting because, in the three variables that can make it up, all contain folding white space (FWS
) references.
dot-atom = [CFWS] dot-atom-text [CFWS]
quoted-string = [CFWS]
DQUOTE *([FWS] qcontent) [FWS] DQUOTE
[CFWS]
obs-local-part = word *("." word)
Per section 2.2.3, the intention of including folding white space was to overcome the 998/78 character limits but not to allow the white space to become a part of the email address. Because my intention here is to construct a regular expression to process a string and validate its potential as a syntactically valid email address, FWS
must be removed and I will therefore not be including it within the regular expression.
In addition, in sections 3.2.4 and 3.2.5, CFWS
and the DQUOTE
s, in the case of quoted-string
, are to be specifically excluded for semantic evaluation. Because of this, I will also be excluding them from the regular expression.
This simplifies things greatly and allows the construction of a strong regular expression for validating an email address. With these changes, I can now rewrite the three variables from local-part
as follows:
dot-atom = dot-atom-text
quoted-string = qcontent
obs-local-part = word *("." word)
dot-atom
is fairly quick and easy to break down...
dot-atom = dot-atom-text
dot-atom-text = 1*atext *("." 1*atext)
atext = ALPHA / DIGIT / ; Any character except controls,
"!" / "#" / ; SP, and specials.
"$" / "%" / ; Used for atoms
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"
So we get dot-atom
=[a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*
.
The next two are rather curious because obs-local-part
contains quoted-string
through word
and because of how obs-local-part
is written, we can exclude quoted-string
altogether as being redundant.
quoted-string = qcontent
obs-local-part = word *("." word)
word = atom / quoted-string
During the breakdown of obs-local-part
, we come across a new case where we need to ignore something for semantic evaluation: quoted-pair
. Looking at section 3.2.2, it instructs us that the "\" character is semantically "invisible" and will therefore be excluded from the regular expression. So, the following RFC definitions...
atom = [CFWS] 1*atext [CFWS]
quoted-pair = ("\" text) / obs-qp
obs-qp = "\" (%d0-127)
become
atom = 1*atext
quoted-pair = text / obs-qp
obs-qp = %d0-127
The breakdown of obs-local-part
goes seven levels of recursion deep, but suffice to say there is a shortcut that eliminates almost all the thinking here. If you noticed, obs-qp
above contains all ASCII characters 0-127. quoted-pair
can be obs-qp
, qcontent
can be quoted-pair
, quoted-string
is qcontent
, and word
can be quoted-string
. Since the period or "full-stop" character is included in ASCII characters 0-127, we can simplify the definition of obs-local-part
to be [\x0-\x7F]+
.
Here are the definitions to support this statement:
obs-local-part = word *("." word)
word = atom / quoted-string
atom = 1*atext
quoted-string = qcontent
qcontent = qtext / quoted-pair
qtext = NO-WS-CTL / ; Non white space controls
%d33 / ; The rest of the US-ASCII
%d35-91 / ; characters not including "\"
%d93-126 ; or the quote character
quoted-pair = text / obs-qp
NO-WS-CTL = %d1-8 / ; US-ASCII control characters
%d11 / ; that do not include the
%d12 / ; carriage return, line feed,
%d14-31 / ; and white space characters
%d127
text = %d1-9 / ; Characters excluding CR and LF
%d11 /
%d12 /
%d14-127 /
obs-text
obs-qp = "\" (%d0-127)
obs-text = *LF *CR *(obs-char *LF *CR)
obs-char = %d0-9 / %d11 / ; %d0-127 except CR and
%d12 / %d14-127 ; LF
Coming full circle, let's revisit the original definitions:
addr-spec = local-part "@" domain
local-part = dot-atom / quoted-string / obs-local-part
and combine into one place the defined regular expressions:
domain = (?=.{0,255})[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*
dot-atom = [a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*
quoted-string = [\x0-\x7F]+
obs-local-part = [\x0-\x7F]+
One final piece to RFC 2822 is section 3.4.1 that states:
The locally interpreted string is either a quoted-string or a dot-atom. If the string can be represented as a dot-atom (that is, it contains no characters other than atext characters or "." surrounded by atext characters), then the dot-atom form SHOULD be used and the quoted-string form SHOULD NOT be used. Comments and folding white space SHOULD NOT be used around the "@" in the addr-spec.
Because of this (and the definitions of SHOULD and SHOULD NOT), we have two regular expressions we can use to validate an email address depending on how strict we want to be.
Option 1: Very strict, ignore the SHOULDs and SHOULD NOTs
^[\x0-\x7F]+@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$
Option 2: Prefer the use of dot-atom
^[a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$
A note about RFC 5322...
The last thing I'm going to add here is that while RFC 2822 was written in April 2001 and provided no character limit to local-part
, RFC 5322 came around in October 2008 and defines a limit of 64 octets in section 4.5.3.1.1. So, we would rewrite the options above as:
Option 1: Very strict, ignore the SHOULDs and SHOULD NOTs
^(?=[^@]{0,64}@)[\x0-\x7F]+@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$
Option 2: Prefer the use of dot-atom
^(?=[^@]{0,64}@)[a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$
Preference note...
I am one who believes that DNS and email addresses should always be represented in lowercase, so my regular expressions are written this way. For processing DNS and mailbox names, the case has always been ignored (as far as I'm aware). For validation, you can either run the regular expression in case-insensitive mode or convert your string input to lowercase before validation.