I have been doing some research/tests on the standardized email format. Ultimately I am looking to develop an email parser for an application. I am noticing some differences in the format of the email, mainly between email clients (gmail, mac mail, etc) and email marketing services (Constant Contact, Mail Chimp, etc).
My understanding of the format (RFC2822) is that a \n\n
separates the headers from the body. These appears to be consistent with emails received from email marketing services. Email clients, however, appear to have an extra set of header(s) or instructions for the message. See examples of email strings below. Note that I pulled these strings via an email pipe. Also note, these are only snippets of the header/body split.
Email Marketing Service:
Content-Type: text/html;
charset="utf-8"
Content-Transfer-Encoding: 8bit
<html>
<head>
<title>Welcome to Banana Republic. Enjoy 25% off! </title>
<STYLE type="text/css">
.ReadMsgBody
{ width: 100%;}
.ExternalClass
{width: 100%;}
Here you will see the line break separating the headers from the body. All good according to the format. Now look at the email client.
Email client:
Mime-Version: 1.0 (Mac OS X Mail 7.0 (1816))
X-Mailer: Apple Mail (2.1816)
--Apple-Mail=_28DD752B-7960-488D-994F-DA9408FCA880
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=windows-1252
Testing Mac Mail. This is the body.
You see that in this case, there is an additional set of "headers" which appear to be instructions about how, in this case, Mac Mail has formatted the email.
I guess my question is, is this a valid format? Is there any specification on it? Is there any well known/documented ways to check for and parse this type of format without knowing which type of format is being received?