3

I am working on a project, where I need to identify emails sent by real humans as opposed to bulk mails, notifications and newsletters. Is there any definite way of doing that? Is there any information in email header which can help. I am working on top of Gmail IMAP so I already have non-spam emails.

Any help in this regard is appreciated. Thanks!

Kuldeep Kapade
  • 1,095
  • 3
  • 12
  • 17

1 Answers1

5

There isn't a clear way to distinguish bulk mail from personalised mailings. Unlike with spam, most bulk mail is requested/expected, so the sender doesn't do odd things to get round spam filters, which means these emails often blend in fairly well.

However, there are some trends that you can look for. If you want to do it reliably, you will probably need to apply some scoring system, like spam-filters do.

You will also need to accept that you are bound to get a substantial proportion of false positives and false negatives.

Some things that are common to bulk mail that appear less often in personalised correspondence:

  1. "To" and "Cc" addresses do not contain a local recipient. Sometimes the sender will send to "mailList@mydomain.com" instead of "recipientA@recipientAdomain.com", "recipientB@recipientBdomain.com", etc. In these cases, it is also likely that only one address appears in "To" and nothing appears in "Cc"
  2. "From" address is "noreply@", "newsletter@", "do-not-reply@", "mailinglist@", even less common terms like "support@" or "sales@" (but remember, they could cause false positives)
  3. The presence of a "List-Unsubscribe:" header
  4. The message contains an unsubscribe link. Run pattern matching to find common phrases in the final few lines of the email. Look for links, or words such as "unsubscribe", "opt out", etc.
  5. Mailing lists tend to have rich content. Check for heavy use of CSS and lots of images, the entire message being contained within a <table></table> or <ul><li></li></ul> structure. i.e. the stuff that something like Dreamweaver would put in, rather than a mail client.
  6. Headers or bold content at the top of the message. If the first bit of a message resembles a newsletter, it's probably a newsletter.
  7. Lots of links or frequent linking to the same (or same few) websites. Newsletters will try to guide the user to the company's site(s), as much as they can. You may score this even more highly if the linked domain matches (or resembles) the sender domain.
  8. Heavy references to social media. If it's a newsletter containing several articles, each story may have its own "Tweet this", "Like this" link. Personal users are likely to contain (at most) one reference to Twitter, Facebook, etc (in their signature)
  9. Notifications and other auto-generated messages will often follow the same basic format. If you have the capabilities, run some kind of diffing or other comparison against previous messages. A strong match would imply automation.
  10. There is no greeting, or a generic greeting. However, personal emails will often skip the "Dear Fred" bit too, so this isn't a good enough detection by itself; but things like "Dear User" or "Dear Customer" are almost certainly generic.
  11. Unlikely to end in "Regards, Ian" or "Yours Sincerely, John Doe"
  12. The sender has scored highly before. Keep a record. If a sender triggers a high score several times, they are almost certainly bulk mailing.
SimonMayer
  • 4,719
  • 4
  • 33
  • 45
  • Thanks! this helps, its more on the lines what i was looking/thinking. I was also thinking about making white list of clients by tracking 'Request' headers. By creating a corpus of known emails and then matching it other emails. Do you think there any flaw with this model? – Kuldeep Kapade Feb 05 '12 at 09:49
  • I'm not sure what you mean by 'Request' headers. The main problems you will have with any method are the time involved in getting the right balance, so you don't get too many false results. White lists are fine, as long as you don't make them so lenient, that they undo all your other work. – SimonMayer Feb 05 '12 at 10:22
  • Each email comes with header info called 'Request' which is information on where its traveled from such as, which client it was sent from, which servers it went through and so on. This is the most reliable information in email headers. I am just trying to figure out how to make sense of that data. – Kuldeep Kapade Feb 07 '12 at 09:34
  • @KuldeepKapade I think you mean the "Received" header. It's probably not a good thing to look for. Only the first "Received" header would be relevant, as all others are just relaying/receiving the message. In any case, it's only worth using if you know the server exclusively sends bulk mail. Many companies will use the same mail server for bulk and personalised mail. – SimonMayer Feb 07 '12 at 13:50
  • Sorry I meant 'received'. I didn't even realized that I was mistyping it. You are right, probably its not best sole indicator. I'll have to rely on more than one indicators to determine this. – Kuldeep Kapade Feb 18 '12 at 09:29