2

I am building a big data analytics solution on top of my web application data. What logical architecture solution of ETL I have is:-
1. Extract - Data is first ingested from MongoDB
2. Transform - Data is transformed where multiple transformations are applied. For e.g. data conversions, data formatting, joining flattened document BSON type data
3. Load - Transformed Data will be finally pushed to Elastic Search
4. I can run my machine learning and build statistical models on the transformed data in ES to generate insights
5. My UI will access these generated insights.

Main issue is at above mentioned step 2 i.e. transformation step. I went through MongoDB rivers, Logstash ETL but what if I want to do heavy transformations which are only possible in Spark.
What can be the optimal solution available in the market right now for this?

From Data Size perspective, accumulation is in GBs on everyday basis and millions of documents in MongoDB.
To limit my scope of development, I have chosen ES as my analytics back-end and MongoDB as my primary database.

Puneet Jindal
  • 147
  • 2
  • 10

1 Answers1

0

You could use mongo-hadoop connector for Apache Spark to extract data from MongoDB. Run transformations in Spark, and even the machine learning with Apache Spark's MLlib.

The resulting data could be stored back into MongoDB. Reducing the number of components in your ETL stack. Although if you want to, you could stored the output back from Apache Spark to some other systems..

Wan B.
  • 18,367
  • 4
  • 54
  • 71
  • How to use Mongo-hadoop connector for Apache Spark if I want my pipeline to process all Mongo collections i.e. I shall not provide the collection name to simulate something like Batch Extract and just querying with timestamp So what my query will be bring all collection documents updated from certain start time to end time – Puneet Jindal Mar 10 '16 at 10:39
  • How to use `mongo-hadoop` would be a separate question and you should post a new question instead. In regards to what you want to query out of MongoDB, whether it be all documents in the collection or from certain time ranges that would be depending on your use cases. Just run MongoDB queries as needed with the right filters. – Wan B. Mar 11 '16 at 12:33