0

I have 2 collections (A and B) with ~70,000 documents each. If I were to compare A and B, 95% of documents would be the same, only 5% would be different. Structure of each document is completely identical in both A and B. A is a constant collection, B is a temporary collection. I want to merge B into A. If document from B exists in A --> update "dateLastSeen" field only. If document in B does NOT exist in A --> insert this document into A.

...I'm using a Python driver (if that matters).

What is the most efficient way of doing this? Thank you.

Victor S
  • 839
  • 1
  • 6
  • 3

1 Answers1

0

The most efficient in terms of queries would be to bulk update all the dates that need to be updated in one go per date and bulk insert all those documents that need inserting.

Given you have 95% the same documents where you want to update A.dateLastSeen to be B.dateLastSeen. With single updates that would be: ~66,500 updates. Leaving ~3,500 inserts.

Loading all B and A in memory - then processing is one possibility.

You could create a bulk insert list and append anytime a doc from B is missing from A. Also a bulk update dictionary keyed by dateLastSeen containing a list of Documents to update. Depends on the probability of any matching dateLastSeen values to see if this is really worth it.

Alternatively, simplify it an accept the high query cost and start processing B in batches of 1000, load the equivalent 1000 in A and compare and update / bulk insert. Keeps the memory print down and adds only ~210 extra queries in total for fetching the batches of data (~70 batch fetches from B, ~70 from A and ~70 bulk inserts).

Ross
  • 17,861
  • 2
  • 55
  • 73