2

I want to be able to move my MongoDB production data to HDFS through a continuous ETL pipeline so as to be able to run Spark/MR jobs on it.

I know that Hadoop MongoDB Connector exists to read/write data from/to MongoDB, but I don't want to incur network I/O and hence, I would ideally want to set up an ETL pipeline to read diff from MongoDB and write into HDFS.

For obvious reasons, I would not like to copy the entire collection to HDFS periodically and only copy the diff.

Any suggestions how can I achieve this ?

What I am thinking of right now is to read MongoDB op-logs and apply these op-logs on previous DB snapshot (already saved in HDFS) to create current MongoDB snapshot in HDFS.

Is there anything better that I can do ?

Vijay Kansal
  • 809
  • 12
  • 26

0 Answers0