8

I've noticed that when you forward an html email from Gmail (not sure about other providers), the html structure changes in the process. The forwarded html loses all the ids declared inside the original html, also some other 'cleaning' happens on the html too.

Can anybody explain why this happens or if it's possible to avoid? Or is it totally dependent on the smtp provider?

I have an app that monitors emails on a specific inbox and tries to parses it, but as I said when the user forwards his email to this inbox (from gmail), the email html structure gets cleaned and my code can no longer parse the html because a lot of the ids are gone.

Due to this, I have to find a new way to parse what I require from the email, like using regular expressions on the plaintext section of the MIME message.

I've searched about this matter and I couldn't find any single piece of information.

prettyvoid
  • 3,446
  • 6
  • 36
  • 60
  • I think there are no standards that say a forwarded message has to look like the original, which means it could depend on the email client, locale, time of day, phase of the moon... Would it be possible to store/exchange the information you need some other way and just include an easy-to-recognize identifier in the e-mail, in the subject for example? – Wander Nauta Jul 14 '15 at 18:01
  • 1
    I'm basically developing an e-tickets parser that works through email, many existing apps provide the same service, you forward your flight itinerary to trips@x.com, the relevant parts of itinerary gets parsed into json and gets posted to a callback url, or if the user is registered, the parsed data gets added to his records. At first I was analyzing itineraries html structure so I can parse it with an html parser and then query the relevant info, but unfortunately that won't work because as I stated, the html changes on a forward. Regex seems to be my only solution now. – prettyvoid Jul 14 '15 at 18:23
  • For itineraries, you should only need the date/time and flight number, right? Those should be pretty feasible to figure out by regex I think. – Wander Nauta Jul 14 '15 at 18:28
  • It's not as straight-forward as you might think, but it's definitely doable. Regex seems to be the only option really. So ya, regex it is for now. I was just wondering why gmail changes the html, I guess like you said it's up to the email provider what they do with the content, they probably save a lot of traffic by cleaning unnecessary content. – prettyvoid Jul 14 '15 at 18:47
  • If it was straight-forward we'd all be out of work, right? :) On topic: some email clients, including recent Thunderbird, have the option to forward messages as attachments instead of inline. If you only parse the plaintext section, you might miss those. Good luck! – Wander Nauta Jul 14 '15 at 18:57
  • Right Wander. :) Thanks for the tip, I've already took care of handling the various multiparts according to their type, so no worries about that. – prettyvoid Jul 14 '15 at 19:34
  • Hey @prettyvoid did you find any generic solutions. My problem is I have already written parsers for forwarded mails. Now I am receiving direct mails and my parsers are breaking. Would appreciate some help. – rusty Apr 26 '16 at 06:56
  • @Rusty Unfortunately relying on html to parse the information ended up being too sloppy/risky for my requirements. I ended up learning regex and using it to parse the information I need. So first I convert any html content inside the email to plain text and then I apply my regex on the plain text. I know that might sound like a lot of work if you already built html parsers, but if you ever learn how to rely on html that is passed through pre-processors (like mentioned in the accepted answer), let me know. – prettyvoid Apr 26 '16 at 09:44
  • This is not regular and absolutely a failure from a business perspective where retention of the original formatting is essential. It's really a showstopper. Forwarding and replying are different processes. Replying doesn't necessarily retain the HTML but forwarding on nearly EVERY email client does. In fact, in web browser Gmail and Inbox forwarding retains HTML. Only in the mobile app it does not. – BruceW Jan 13 '17 at 18:43

2 Answers2

5

Gmail strips head tag and Ids and classes on pre-processor. That means when you forward or reply, to gmail, these items never existed so are not included on reply.

Gortonington
  • 3,557
  • 2
  • 18
  • 30
  • Didn't know about [preprocessing](https://litmus.com/help/email-clients/rendering-engines/). At least now I understand how email clients operate. Thanks. – prettyvoid Jul 14 '15 at 19:43
  • I wonder though why does the email gets preprocessed only when it's forwarded, why not when it's recieved? – prettyvoid Jul 14 '15 at 19:49
  • It does upon receipt, which is what is displayed. If you view source, you view the code prior to when it is run through the pre-processor, not after. – Gortonington Jul 14 '15 at 20:15
  • 1
    an interesting read if you are an email nerd like me - How email works part 1 - http://www.clickz.com/clickz/column/2411041/how-email-works-part-one-the-story-of-send and then part 2: http://www.clickz.com/clickz/column/2415472/how-email-works-part-two – Gortonington Jul 14 '15 at 20:19
2

As Gmail removes head tag, id, classes and more, the best way is to use inline CSS style.

Tip: An inline style loses many of the advantages of a style sheet (by mixing content with presentation). Use this method sparingly.

Mega J
  • 542
  • 6
  • 14