I am building a big data analytics solution on top of my web application data.
What logical architecture solution of ETL I have is:-
1. Extract - Data is first ingested from MongoDB
2. Transform - Data is transformed where multiple transformations are applied. For e.g. data conversions, data formatting, joining flattened document BSON type data
3. Load - Transformed Data will be finally pushed to Elastic Search
4. I can run my machine learning and build statistical models on the transformed data in ES to generate insights
5. My UI will access these generated insights.
Main issue is at above mentioned step 2 i.e. transformation step. I went through MongoDB rivers, Logstash ETL but what if I want to do heavy transformations which are only possible in Spark.
What can be the optimal solution available in the market right now for this?
From Data Size perspective, accumulation is in GBs on everyday basis and millions of documents in MongoDB.
To limit my scope of development, I have chosen ES as my analytics back-end and MongoDB as my primary database.