1

We are developing a big data solution in which one requirement is to process incoming emails. The technology stack is not finalized yet but mostly we might go with Sendmail as MTA and Procmail as MDA. We are open to any other very efficient solution.

These emails are essentially carry data in attachments and are not meant for end user, so the email flow ends with Spark processing.

My first thought was it would be great if there was a message queuing system such as Apache-Kafka which could accept emails as messages and then provide them to the client such as Spark on demand but it seems that sort of technology/approach is not available in any of the message brokering systems.

This means we would have to receive emails via SMTP MTA and then extract the information from the MDA.

We could use Procmail to extract the contents of the email and the attachments and put them in a folder per email and then scan the folders and process them in spark.

Alternatively if Spark has any plugins which could pull in emails from an MDA and break it down into it's attachments it would make life much simpler.

If there is any other smarter solution it would be welcome.

So the fundamental question is what technology is available for channelizing emails through Spark for processing. Connectors etc.

  • "Sendmail", in this day and age; seriously? Most shops who don't have a legacy Sendmail to feed will prefer another MTA; Postfix is very popular as a replacement. – tripleee Sep 30 '15 at 03:09
  • What is your question? We are not going to implement this system for you. – tripleee Sep 30 '15 at 03:10
  • @tripleee, I've updated the question but essentially I need to know if there are any connectors to suck in email into Spark for distributed large scale processing. If yes what are they and if not then to discuss the best course of action. – Mohammed Lokhandwala Sep 30 '15 at 04:42
  • Email itself *is* a "message queueing system". You can easily pick apart a MIME message with e.g. `munpack`; see also http://superuser.com/questions/406125/utility-for-extracting-mime-attachments – tripleee Sep 30 '15 at 04:51
  • @tripleee, I already saw those but they are like second / third best option as it involves several processing steps. Ideally it should be like a server that persists to some store readable by Spark or passes it to Spark streaming. – Mohammed Lokhandwala Sep 30 '15 at 08:39

1 Answers1

1

Mailgun or Sendgrid incoming email processing is so easy that I could hardly imagine any alternative for a new, especially big, system. I only played with them, but my impression was that my any actual or potential (billions of emails) problem related to emails is solved for good. Not related to Spark, those system just post email content as http POST request to a URL you provide.

Sendgrid used to incorrectly parse encoding, their support ignored my emails and eventually deleted a ticket without solving the problem. Mailgun always returns UTF8 regardless of original encoding. Manual MIME parsing is such a grandiose task itself so it is better to use existing solutions, unless emails are generated by a computer. But even then, IaaS services are so much cheaper than developer time.

V.B.
  • 6,236
  • 1
  • 33
  • 56