3

My organisation have MongoDB which stores application based time-series data. Now we are trying to create a data pipeline for analytics and visualisation. Due to time-series data we plan to use Druid as intermediate storage where we can do the required transformation and then use Apache Superset to visualise. Is there any way to migrate required data (not only updates) from MongoDB to Druid?

I was thinking about Apache Kafka but from what I have read, I understood that it will work better only to stream the changes happening in topics (topic associated with tables) which already exists in MongoDB and Druid. But what if there is a table of at least 100,000 records which exists only in MongoDB and first I wish to push whole table to Druid, will Kafka work in this scenario?

hemant A
  • 185
  • 14
  • Have you tried using Debezium to get data out of Mongo and into Kafka? – OneCricketeer Sep 17 '19 at 03:33
  • @cricket_007: I tried using it last week but I was getting "memory out of space" error. I tried a little to solve it but was not able to and due to time constraint I was unable to spend more time on it. So currently I am trying to develop a python script for data migration from MongoDb to Druid. – hemant A Sep 18 '19 at 05:53
  • You could've increased the heap size to fix that error... Writing your own consumer is brittle because managing offsets and failure conditions can be difficult – OneCricketeer Sep 18 '19 at 07:40
  • @cricket_007 agree that writing a consumer from scratch is not the ideal and efficient way. I will try using Kafka again with increased heap size. There was another error also "timed out" for which I raised a question also and I think you edited that question also. – hemant A Sep 19 '19 at 07:03
  • By default, Kafka only exposes its ports locally. You'll have to edit its properties to make it accept external connections – OneCricketeer Sep 19 '19 at 12:44

0 Answers0