0

I have been doing some research/tests on the standardized email format. Ultimately I am looking to develop an email parser for an application. I am noticing some differences in the format of the email, mainly between email clients (gmail, mac mail, etc) and email marketing services (Constant Contact, Mail Chimp, etc).

My understanding of the format (RFC2822) is that a \n\n separates the headers from the body. These appears to be consistent with emails received from email marketing services. Email clients, however, appear to have an extra set of header(s) or instructions for the message. See examples of email strings below. Note that I pulled these strings via an email pipe. Also note, these are only snippets of the header/body split.

Email Marketing Service:

Content-Type: text/html;
    charset="utf-8"
Content-Transfer-Encoding: 8bit

 
<html>
<head>
    <title>Welcome to Banana Republic. Enjoy 25% off!   </title>
<STYLE type="text/css">
.ReadMsgBody
{ width: 100%;}
.ExternalClass
{width: 100%;}

Here you will see the line break separating the headers from the body. All good according to the format. Now look at the email client.

Email client:

Mime-Version: 1.0 (Mac OS X Mail 7.0 (1816))
X-Mailer: Apple Mail (2.1816)


--Apple-Mail=_28DD752B-7960-488D-994F-DA9408FCA880
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
    charset=windows-1252

Testing Mac Mail. This is the body.

You see that in this case, there is an additional set of "headers" which appear to be instructions about how, in this case, Mac Mail has formatted the email.

I guess my question is, is this a valid format? Is there any specification on it? Is there any well known/documented ways to check for and parse this type of format without knowing which type of format is being received?

Community
  • 1
  • 1
Chris
  • 4,762
  • 3
  • 44
  • 79
  • 1
    You need to look at multiple other RFCs, like RFC2045-2047 (MIME encodings) and how they describe multipart messages. I'm assuming your 2nd fragment is not including the Content-Type: multipart/mixed; boundary=Apple-Mail=_28DD752B-7960-48 8D-994F-DA9408FCA880 that I'd expect to see as part of that (where you can have multiple sub-sections, each conforming to RFC2822 rules). Proper and complete email parsing is HARD. What's allowed is spread out all over the place. – Joe Nov 18 '13 at 00:03
  • Note this link, which references a number of Email-related RFCs: http://www.lsoft.com/manuals/Maestro/2.1/Users/WebHelp/Appendix_D_Email_Related_RFCs.htm – Joe Nov 18 '13 at 00:04
  • @Joe - Content-Type: multipart/alternative actually. Not sure if that makes a difference, but I am going through the RFC references you provided to see if I can learn more. – Chris Nov 18 '13 at 05:44
  • 1
    As an author of several robust email parsers (GMime, MimeKit, Camel, etc) - PLEASE PLEASE PLEASE do not implement your own parser/generator unless you are committed to implementing everything correctly rather than yet-another-quick-and-dirty parser which only ends up making the job of people writing real parsers harder because so many people write parsers/generators that get it wrong so badly. – jstedfast Nov 19 '13 at 16:16

1 Answers1

0

[extending points made in comments]

is this a valid format?

Yes. The overall framework for mail messages more complex than strict 7-bit ASCII text is known as MIME. It includes the specification of the "Content-Type" header in your first example that informs a client that the whole message is HTML rather than plain text. Many (possibly most) messages these days are of type "multipart/alternative" at the outermost level, encapsulating 2 (or more!) versions of the message body, most often a text/plain representation and text/html version, which is itself often inside a multipart/mixed container including embedded images.

Is there any specification on it?

Yes. The basics of MIME are described in RFC's 2045-2049 and there have been many extensions and corrections described in many later RFC's and type registration docs. MIME also provides the core components for the specification of HTTP documents, so many of the extensions are almost irrelevant for email.

Is there any well known/documented ways to check for and parse this type of format without knowing which type of format is being received?

Yes. While nearly all modern email is in MIME format, formally you can detect it by looking for the "MIME-Version" header. See RFC2045 for specifics. Note that your first example doesn't show that header but it must have existed in the full original because otherwise the headers you showed would be meaningless.

This demonstrates why you probably should reconsider the idea of writing your own mail parser. What you saw as 2 formats are not that in fact, rather they are just different applications of the MIME format framework. MIME is significantly older than RFC2822 (which, incidentally, is itself obsoleted by RFC5322) and has many mature and robust parsers available. It is easy to write a MIME parser that will work for most mail, a little harder to write one that will work for nearly all valid mail, and sanity-challenging to write one that will safely handle the real world of mail which often isn't exactly correct and in some cases is designed to break naive parsers in malicious ways. Take advantage of the torn-out hair of decades of coders who have preceded you: use an existing parser.

Bill Cole
  • 151
  • 1
  • 5