0

I've tried many different combinations using Azure Data Factory to create a clone of a CosmosDB collection that maintains the order of items written to a partition, but unless I specify a batch write size of 1, it does not keep the order. Even triggering from the Change Feed of the source in a mapping data flow does not preserve order. We have written a simple tool that copies a record at a time, but obviously, that is slow.

We are using Cosmos as an event store, and the change feed processor feeds our projectors - it all works really well, but we would like to copy the events out to a different environment to test changes. This requires the original write order to be preserved.

Thanks in advance.

Naeem Khoshnevis
  • 2,001
  • 4
  • 18
  • 30
Darren Hall
  • 97
  • 1
  • 1
  • 9

2 Answers2

1

The change feed processor does read from each physical partition in _ts order.

Certainly I've been able to use this to successfully copy very large collections (> 1TB) in a matter of a few hours.

For this I've used a function app scaled across multiple instances, ensured the leases collection has sufficient max RU configured to not become a bottle neck and when provisioning the target scaled up the RU sufficient to create the desired number of physical partitions up front rather than having the partitions split during the import.

I have always used bulk insert though so within each batch delivered by the change feed processor I guess the _ts could become disordered. This has never been important for me.

The most efficient way of copying the collection to a new one and preserving the _ts order would certainly be to restore a backup.

It also has the benefit that you do not have to write any code and provision any resources to do it. If you are not already using the continuous backup model you should consider switching to it as this allows the restore to be self service and to a specified point in time.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • Thanks Martin, I think switching to continuous backup is good advice regardless. As you suspected above, the use of bulk insert (either manually or by utilising Data Factory) means that we cannot guarantee write order per physical partition. Seems like my only option may be to trickle feed my other environments in a similar way to you but with single writes. – Darren Hall Jan 10 '22 at 10:36
0

get a tool like cerebrata it will do copy between collections etc as you see fit, if you are doing a lot of Azure work specially with CosmosDB it is a very handly tool to use, I could not live without it these days.

Disclaimer: I do NOT work for cerebrata nor do I receive any benefit for recommending their tools its is purely based on my own experience.

Matt Douhan
  • 677
  • 3
  • 13