I want to be able to move my MongoDB production data to HDFS through a continuous ETL pipeline so as to be able to run Spark/MR jobs on it.
I know that Hadoop MongoDB Connector exists to read/write data from/to MongoDB, but I don't want to incur network I/O and hence, I would ideally want to set up an ETL pipeline to read diff from MongoDB and write into HDFS.
For obvious reasons, I would not like to copy the entire collection to HDFS periodically and only copy the diff.
Any suggestions how can I achieve this ?
What I am thinking of right now is to read MongoDB op-logs and apply these op-logs on previous DB snapshot (already saved in HDFS) to create current MongoDB snapshot in HDFS.
Is there anything better that I can do ?