MongoDB ETL into HDFS

Asked Dec 08 '15 at 11:45

Active Dec 08 '15 at 11:45

Viewed 377 times

I want to be able to move my MongoDB production data to HDFS through a continuous ETL pipeline so as to be able to run Spark/MR jobs on it.

I know that Hadoop MongoDB Connector exists to read/write data from/to MongoDB, but I don't want to incur network I/O and hence, I would ideally want to set up an ETL pipeline to read diff from MongoDB and write into HDFS.

For obvious reasons, I would not like to copy the entire collection to HDFS periodically and only copy the diff.

Any suggestions how can I achieve this ?

What I am thinking of right now is to read MongoDB op-logs and apply these op-logs on previous DB snapshot (already saved in HDFS) to create current MongoDB snapshot in HDFS.

Is there anything better that I can do ?

asked Dec 08 '15 at 11:45

Vijay Kansal

Did you work on this approach? – Anuj Aneja Nov 03 '16 at 18:10
We solved the problem a bit differently. We read bin logs from Mongo, uploaded them to S3 and then wrote the said job in Spark to maintain latest snapshot in HDFS/S3 – Vijay Kansal Nov 04 '16 at 11:01
I have a similar usecase to be solved. Can we discuss more on chat or email or phone? Here is my profile link: Please connect if it is okay for you!!! https://www.linkedin.com/in/anuj-aneja-06713b20?trk=hp-identity-name – Anuj Aneja Nov 07 '16 at 08:13
anuj.ymca@gmail.com – Anuj Aneja Nov 07 '16 at 16:28
can anyone send details of the approach. – vaspaean Feb 11 '21 at 06:59

MongoDB ETL into HDFS

0 Answers0