0

The Problem

Any Gmail message can be encapsulated as a single raw file. My assumption is that such raw file would contain anything needed to properly display the email along with all of its indigents.

I was looking for a way to process such file programmatically. There are two approaches of processing Gmail messages:

  1. Interfacing with the Gmail server via Gmail API. Doing so require authentication followed by a HTTP / HTTPS interaction as explained in the Gmail API documentation.

  2. Statically parsing the raw data, extracting from it all elements which sums up to an entire email message. These may include:

    • Email's attributes (sender's name, sender's email, date, subject, etc.)
    • Body (usually an HTML one, which may include embedded images and other files, which are required for the HTML file to be properly displayed).
    • Attachments.

My question:

How to statically parse such Gmail's message raw data without any need to interact with the Gmail server / API, but just by using a MIME parses like this one, and on top of it, add any code required to find and extract any Gmail specific as listed above.

What I wrote so far:

I have started parsing the raw data (stored in szMailBody): (using this parser).

    LPCSTR szMailId, LPCSTR szMailBody;

    MIMELIB::CONTENT c;

    while ((*szMailBody == ' ') || (*szMailBody == '\r') || (*szMailBody == '\n'))
    {
        szMailBody++;
    }
    char deli[] = "<pre class=\"raw_message_text\" id=\"raw_message_text\">";
    szMailBody = strstr(szMailBody, deli);
    szMailBody += strlen(deli);


    if (c.Parse(szMailBody) != MIMELIB::MIMEERR::OK)
        return;  

    // Get some headers
    auto senderHdr = c.hval("From");
    auto dateHdr = c.hval("Date");
    auto subjectHdr = c.hval("Subject");

    auto a1 = c.hval("Content-Type", "boundary");
    // Not a multi-part mail if empty
    // Then use c.Decode() to get and decode the single part body
    if (a1.empty())
        return;
    auto a2 = c.hval("_NextPart_000_0046_01D38959.20888970");
    if (a2.empty())
        return;

// _NextPart_000_0046_01D38959.20888970
    vector<MIMELIB::CONTENT> Contents;
    MIMELIB::ParseMultipleContent2(szMailBody,strlen(szMailBody), a2.c_str(), Contents);

My question is different than this one, because Gmail raw data is complex enough to require further steps to take, even when the user is familiar with MIME parsing. There is more complexity extracting attachments into separate files (for example), or restoring the email's body, as an HTML file, along with its dependencies (such as embedded images). The technique for processing Gmail raw data requires a layer of instructions on top of MIME parsing.

Michael Haephrati
  • 3,660
  • 1
  • 33
  • 56
  • 1
    Best reword this so it looks less like a request for a library or tutorial. – user4581301 Jan 12 '18 at 19:36
  • Better? :) I made it clear that I am asking a question (which I am), not requesting a library or a tutorial... – Michael Haephrati Jan 12 '18 at 19:39
  • Possible duplicate of [Simple C++ MIME parser](https://stackoverflow.com/questions/218089/simple-c-mime-parser) – rustyx Jan 12 '18 at 20:19
  • No because Gmail raw data is a special case of MIME and require special handling, while there aren't any solutions out there, especially not Win32 / c++ ones. My question is specifically about parsing Gmail raw data. – Michael Haephrati Jan 12 '18 at 20:20
  • @MichaelHaephrati: "*Interfacing with the Gmail server via Gmail API. Doing so require authentication followed by a HTTP / HTTPS interaction as explained in the Gmail API documentation.*" - actually, it doesn't, if you use SMTP/IMAP to access the emails. – Remy Lebeau Jan 12 '18 at 23:33
  • 2
    @MichaelHaephrati: "*Gmail raw data is a special case of MIME*" - what makes you think that? Looks like a standard MIME-encoded email to me. "*and require special handling*" - like what exactly? What is your line of thinking on this? Are you referring to the extra stuff that *Gmail's website* adds around the email when you invoke the "Show original" option? The stuff that is not part of the email itself. The stuff you won't see if you stop interfacing with Gmail over HTTP to begin with. You should be using POP3/SMTP/IMAP instead – Remy Lebeau Jan 12 '18 at 23:35
  • @RemyLebeau please allow me to explain: I know POP3/SMTP/IMAP can be used to fetch emails form any server, including Gmail, but my question is entirely different. It is about extracting an email, statically, from its raw data. Doing so with Gmail raw data requires more processing which is the reason of my question. Using a MIME parser is one of the steps but does not cover the entire step. The idea of interpreting an email with no internet connection is interesting for me and I hope, for others as well. – Michael Haephrati Jan 13 '18 at 11:55
  • @RemyLebeau - No, Gmail raw data encapsulates everything needed to get the contents of an email with no interaction and no authentication. For the sake of my argument, provided that you connected to your Gmail account and saved one of your emails as a "Raw data" format. Few days later you are offline and can't connect to the Gmail server. You can still be able to re-compose this email from its raw data. That's what my question is about. – Michael Haephrati Jan 13 '18 at 13:05
  • @MichaelHaephrati I did look at the raw data before posting my last comments. I saw nothing out of the ordinary that a standard MIME parser shouldn't be able to handle. So, what exactly do you is "extra" that is prohibiting you from accomplishing that? Please edit your question to show an actual example you are having trouble with. – Remy Lebeau Jan 13 '18 at 16:09
  • Gmail raw data uses MIME to its full extent. See : http://www.ehfeng.com/gmail-api-mime-types/ however I think there is a place for a question about how to specifically parse Gmail messages because there are 2 issues to address: 1. Is it even possible to get an entire email just from its raw data, with no need for any interaction with the Gmail server ( I think - yes). 2. How would such interpretation of the raw data be done using MIME. That requires knowledge in MIME but not just... It combines understanding the Gmail raw data format and the MIME format. – Michael Haephrati Jan 13 '18 at 18:36
  • @MichaelHaephrati it is clear that you don't understand MIME or how it works. Gmail's raw format is just plain ordinary standard MIME, there is nothing special about it. A email is self-contained, it has everything needed to recreate the content the user sees. – Remy Lebeau Jan 13 '18 at 19:05
  • @RemyLebeau I respect your opinion. In my opinion, this is like claiming such and such c++ function is just plain and ordinary as it uses c++ ... My question combines 3 pillars: MIME, Gmail and c++. There are questions about parsing Gmail raw data in Python, for example (https://stackoverflow.com/questions/46431189/python-parse-gmail-messages-get-with-format-raw). Mine is about parsing Gmail raw data in c++. – Michael Haephrati Jan 13 '18 at 19:41

0 Answers0