-1

I am working on a project where I need to migrate MongoDB fields to GC storage. There is a total of 5million documents in MongoDB from which I need to transfer some fields to GC storage. I want to know is there any efficient way to read for eg: the first 100records and then transfer it to GC, then read again the next 100 records, and so on. I know mongoTemplate have findAll with pagination, but after researching out I found that it's not a good way to do that. Do we have any kind of item reader in mongoDB?

Shrey Soni
  • 31
  • 9

1 Answers1

-1

There are many way to do this.

First of all, you need to consider the options for storing your documents in GCS...

E.g. parquet (columnar for analytics), avro (row based), multiple JSONs, ... and how you want to partition this table into files.

If you are planning on saving all the documents into a single file, you can't distribute this process, whereas if you are planning on partitioning the data you can distribute this by the partitions.

Second of all, you need to consider the source documents structure... In order to read the data in bulks, you need to sort by a unique column (e.g. _id) and use skip & take in order to paginate it. If you are partitioning the data, you need to either paginate per partition, or divide into small enough partitions that you can read each of the partition in one go.

After you answer these questions, you can choose a suitable technology for either serial work or distributed work (e.g. Spark for distributed).

Danny Varod
  • 17,324
  • 5
  • 69
  • 111
  • This migration is one time, I have following big fields in mongo document plaincontent, htmlcontent, attachments which need to be migrated to google cloud bucket. – Shrey Soni Jan 18 '22 at 13:06
  • This doesn't answer any of the questions I've placed. Read my answer again and if it doesn't give you directions, perhaps find a consultant to help you. – Danny Varod Jan 18 '22 at 16:57
  • Hi @Danny Varod, first of all thank you for sharing solution, I new to this if u can explain in detail it will be nice. I currently use spring batch which fetches record from mongoDB and it uploads it to the GC and then I need to update GC URI to that document which we will be using in future. – Shrey Soni Jan 19 '22 at 06:07
  • So one file per document? Anyway, there are a lot of details to consider and little detail on what you are attempting to achieve. As it is, this question is not answerable. It seems like you need a (software engineer experienced in big data) consultant to interview you and then guide you on options and on a specific solution matching your requirements and preferences (will probably take a few hours of meeting and a few more if you want implementation). Try searching for a local consultant. – Danny Varod Jan 19 '22 at 12:14