We are developing a big data solution in which one requirement is to process incoming emails. The technology stack is not finalized yet but mostly we might go with Sendmail as MTA and Procmail as MDA. We are open to any other very efficient solution.
These emails are essentially carry data in attachments and are not meant for end user, so the email flow ends with Spark processing.
My first thought was it would be great if there was a message queuing system such as Apache-Kafka which could accept emails as messages and then provide them to the client such as Spark on demand but it seems that sort of technology/approach is not available in any of the message brokering systems.
This means we would have to receive emails via SMTP MTA and then extract the information from the MDA.
We could use Procmail to extract the contents of the email and the attachments and put them in a folder per email and then scan the folders and process them in spark.
Alternatively if Spark has any plugins which could pull in emails from an MDA and break it down into it's attachments it would make life much simpler.
If there is any other smarter solution it would be welcome.
So the fundamental question is what technology is available for channelizing emails through Spark for processing. Connectors etc.