CosmosDB - DocumentDB - Bulk insert without saturating collection RU

Question

I am investigating using Azure CosmosDB for an application that would require high read throughput, and the ability to scale. 99% of the activity would be reads, but occasionally we would need to insert somewhere from just a few documents to potentially a batch of a few million.

I have created a collection to test with and provisioned 2500 RU/sec. However I am running into issues with inserting even just 120 small (500 bytes) documents (I get "request rate is large" error).

How can I possibly use document db in any useful way, if any time I want to insert some documents it will use all my RU and prevent anyone from reading it?

Yes, I can increase the RUs provisioned, but if I only need 2500 for reads, I don't want to have to pay for 10000 just for the occasional insert.

Reads need to be as fast as possible, ideally in the "single-digit-millisecond" range that Microsoft advertises. The inserts do not need to be as fast as possible, but faster is better.

I have tried using a stored procedure which I have seen suggested, but that also fails to insert all reliably, I have tried creating my own bulk insert method using multiple threads as suggested in the answer here but this produces very slow results and also often errors for at least some documents, and seems to average a RU rate of well below what I've provisioned.

I feel like I must be missing something, do I have to massively over provision RU just for writes? Is there some kind of functionality built in to limit the RU use for inserting? How is it possible to insert hundreds of thousands of documents in a reasonable amount of time, and without making the collection unusuable?

Without seeing your data schema or partitioning, hard to give anything definitive, but... you can try changing your indexing policy to lazy (from consistent), as well as changing your indexing policy to remove properties you don't need indexed. This should lower your per-document RU cost per insert (but I can't tell you how much it would save you). — David Makogon, Aug 11 '17 at 10:22
@DavidMakogon Thanks, I may try that but it seems like offsetting the problem rather than solving it. I could do that and it might allow me to insert some documents, but next time I might need to insert more and have this issue again — QTom, Aug 11 '17 at 10:34
Like I said, I don't understand your overall data model. But... one more idea: since you only do occasional inserts, consider enabling per-minute RU burst, which gives you a 10x RU capacity, spread out over a per-minute time period. This might give you enough overhead to deal with inserts, and per-minute burst should be much more cost-efficient than a constant higher RU rate. — David Makogon, Aug 11 '17 at 10:36
@DavidMakogon the thing is I will never really know exactly what/how much data is to be inserted, should I calculate the RU required and change it when inserting? Or is DocumentDB just not suitable unless you have a clear definition of how many RU you need? — QTom, Aug 11 '17 at 11:05
@Tom Do you partition enabled in your collection ? Generally RU/s configured at a high level are uniformly distributed across logical partitions. So if you doing bulk inserts in a single partition that might be exhausting the provisioned RU/s.As David recommended try to enable RU/minute and opt for eventual consistency, or disable indexing for keys which are not used in querying. If bulk inserts operations are scheduled like (once in a day), you can even try to increase RU/s before writing and bringing them down once done with the operations over collection.Let me know if that helps. — Surender Singh Malik, Aug 11 '17 at 14:16
@SurenderSinghMalik Basically we would be using cosmos db to store data uploaded by clients, so we are unable to predict the size or schedule of the uploads. We currently use mongodb for this, but were interested in the read performance and scalability of cosmos, but it seems it might not be suitable for this kind of use — QTom, Aug 11 '17 at 15:04
@Tom How about taking benefits of partition and using Read regions for your read workload ? For example if you want to run with low RU/sec, you can always write to Write region and read from read regions using the RU/sec of read region.I think we can discuss this in details and see what are the challenges you are facing while using Cosmos. — Surender Singh Malik, Aug 14 '17 at 08:49
Did you check https://stackoverflow.com/questions/41744582/fastest-way-to-insert-100-000-records-into-documentdb?noredirect=1&lq=1 — Kiran Kolli, Aug 15 '17 at 22:34
@KiranKolli I did, my takeaway was that to match our mongodb insert performance (10000 docs in <1 second) I need to provision ~50,000 RU which is ~$3000 a month... — QTom, Aug 16 '17 at 10:04
You should use bulk write operations if you use mongo driver and they should give result to every record if success or fail. In case you dont need upsert you can use insertmany and that somehow works. Microsoft has a bug here so its best to submit and wait. Current situation makes this unusable. DynamoDb in aws has this functionality — Martin Kosicky, Apr 17 '19 at 10:03
@QTom : Can you paste the code you are using to write into Cosmos? What are the fields/columns? What is your partition key and how is the data distributed across the logical partitions? 120 documents of .5KB is really small. You should look at optimizing the indexing. You can also look at using the rest API to temporarily increase the throughput and then bring it down. Don't increase beyond 10K if you have 1 physical partition, else cosmos will split your data into 2 physical partitions. — Anupam Chand, Apr 19 '21 at 03:11

Rob Reagan · Answer 1 · 2021-08-09T20:24:42.630

Performing bulk inserts of millions of documents is possible under certain circumstances. We just went through an exercise at my company of moving 100M records from various tables in an Azure SQL DB to CosmosDb.

It's very important to understand CosmosDb partitions. Choosing a good partition key that spreads your data out among partitions is critical to get the kind of throughput you're looking for. Each partition has a maximum RU/s throughput of 10k. If you're trying to shove all of your data into a single partition, it doesn't matter how many RU/s you provision, because anything above 10k is wasted (assuming nothing else is going on for your container).
Also, each logical partition has a max size of 20GB. Once you hit 20GB in size, you'll get errors if you attempt to add more records. Yet another reason to choose your partition key wisely.
Use Bulk Insert. Here's a great video that offers a walkthrough. With the latest NuGet package, it's surprisingly easy to use. I found this video to be a much better explanation than what's on learn.microsoft.com.

Edit CosmosDb now has Autoscale. With Autoscale enabled, your Collection will remain at a lower provisioned RU/s, and will automatically scale up to a max threshold when under load. This will save you a ton of money with your specified use case. We've been using this feature since it went GA.

If the majority of your ops are reads, look into Integrated Cache. As of right now, it's in public preview. I haven't played with this, but it can save you money if your traffic is read-heavy.

score 0 · Answer 2 · answered Aug 12 '17 at 00:03

The key to faster insertion is to distribute your load across multiple physical partitions. In your case, based on the total volume of data that is there in the collection, you would have a minimum of totalvolume/10GB partitions. Your total RUs are equally distributed among these partitions.

Based on your data model, if you could partition your data, you could potentially gain speed by writing to different partitions in parallel.

Since you mentioned that you occasionally have to write a batch of a few million rows, I would advice increase the RU's capacity for that period and decrease it back to the levels required for your read load.

Writing using Stored procedures, while saving on the network calls that you make, might not yield much benefit because, the stored procedure could only execute on a single partition. So it could only use the RU's that are allocated to that partition.

https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data#designing-for-partitioning has some good guidance around what kind of partition makes sense.

score 0 · Answer 3 · answered Nov 17 '17 at 19:13

If you can't improve the cost of your inserts, you might go the other way and slow down the process to that your overall performance is not impacted. If you look at the offical performance benchmarking sample (which inserts documents), you could take this as an idea on how to limit the RU/s you require for inserts. It shows a lot of parameters that can be tweaked to improve performance, but those can obviously also be used to tailor your RU/s consumption to a certain level.

The answer by KranthiKiran pretty much sums up all other things I can think of.

score 0 · Answer 4 · answered Apr 07 '20 at 21:44

You could also use the new autopilot mode. Containers configured in autopilot mode adjust the capacity to meet the needs of the application's peak load and scale back down when the surge of activity is over. You need to specify the maximum throughput.

CosmosDB - DocumentDB - Bulk insert without saturating collection RU

4 Answers4