Deleting Millions of Documents in cosmos in a reasonable amount of time

Question

Recently I have been working a lot with Cosmos and ran in to an issue when looking at deleting documents.

I need to delete around ~40 million documents in my Cosmos Container, I've looked around quite a bit and found a few options of which i have tried. two of the fastest of which I've tried are using a stored procedure within cosmos to delete records and using a bulk executor.

Both of these options have given subpar results compared to what I am looking for. I believe this should be obtainable within a couple hours but at the moment I am getting performance of around 1 hour per million recordsT

the two methods I used can also be seen here:

Stack Overflow Post on Document Deletion

My documents are about 35 keys long where half are string values and the other half are float/integer values, if that matters, and there are around 100k records per partition.

Here is are the two examples that I am using to attempt the deletion:

This first one is using C# and the documentation that helped me with this is here:

GitHub Documentation azure-cosmosdb-bulkexecutor-dotnet-getting-started

using System;
using System.Collections.Generic;
using System.Configuration;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.CosmosDB.BulkExecutor;
using Microsoft.Azure.CosmosDB.BulkExecutor.BulkImport;
using Microsoft.Azure.CosmosDB.BulkExecutor.BulkDelete;

namespace BulkDeleteSample
{
    class Program
    {

        private static readonly string EndpointUrl = "xxxx";
        private static readonly string AuthorizationKey = "xxxx";
        private static readonly string DatabaseName = "xxxx";
        private static readonly string CollectionName = "xxxx";


        static ConnectionPolicy connectionPolicy = new ConnectionPolicy
        {
            ConnectionMode = ConnectionMode.Direct,
            ConnectionProtocol = Protocol.Tcp
        };


        static async Task Main(string[] args)
        {

            DocumentClient client = new DocumentClient(new Uri(EndpointUrl), AuthorizationKey, connectionPolicy);
            DocumentCollection dataCollection = GetCollectionIfExists(client, DatabaseName, CollectionName);

            // Set retry options high during initialization (default values).
            client.ConnectionPolicy.RetryOptions.MaxRetryWaitTimeInSeconds = 30;
            client.ConnectionPolicy.RetryOptions.MaxRetryAttemptsOnThrottledRequests = 9;

            BulkExecutor bulkExecutor = new BulkExecutor(client, dataCollection);
            await bulkExecutor.InitializeAsync();

            // Set retries to 0 to pass complete control to bulk executor.
            client.ConnectionPolicy.RetryOptions.MaxRetryWaitTimeInSeconds = 0;
            client.ConnectionPolicy.RetryOptions.MaxRetryAttemptsOnThrottledRequests = 0;

            List<Tuple<string, string>> pkIdTuplesToDelete = new List<Tuple<string, string>>();

            for (int i = 0; i < 99999; i++)
            {
                pkIdTuplesToDelete.Add(new Tuple<string, string>("1", i.ToString()));
            }

            BulkDeleteResponse bulkDeleteResponse = await bulkExecutor.BulkDeleteAsync(pkIdTuplesToDelete);

        }

        static DocumentCollection GetCollectionIfExists(DocumentClient client, string databaseName, string collectionName)
        {

            return client.CreateDocumentCollectionQuery(UriFactory.CreateDatabaseUri(databaseName))
                .Where(c => c.Id == collectionName).AsEnumerable().FirstOrDefault();
        }

    }
}

The second one is using a stored procedure I found which delete data from a given partition using a query, of which I am running via a python notebook.

Here is the stored procedure:

/**
 * A Cosmos DB stored procedure that bulk deletes documents for a given query.
 * Note: You may need to execute this sproc multiple times (depending whether the sproc is able to delete every document within the execution timeout limit).
 *
 * @function
 * @param {string} query - A query that provides the documents to be deleted (e.g. "SELECT c._self FROM c WHERE c.founded_year = 2008"). Note: For best performance, reduce the # of properties returned per document in the query to only what's required (e.g. prefer SELECT c._self over SELECT * )
 * @returns {Object.<number, boolean>} Returns an object with the two properties:
 *   deleted - contains a count of documents deleted
 *   continuation - a boolean whether you should execute the sproc again (true if there are more documents to delete; false otherwise).
 */
function bulkDeleteSproc(query) {
    var collection = getContext().getCollection();
    var collectionLink = collection.getSelfLink();
    var response = getContext().getResponse();
    var responseBody = {
        deleted: 0,
        continuation: true
    };

    // Validate input.
    if (!query) throw new Error("The query is undefined or null.");

    tryQueryAndDelete();

    // Recursively runs the query w/ support for continuation tokens.
    // Calls tryDelete(documents) as soon as the query returns documents.
    function tryQueryAndDelete(continuation) {
        var requestOptions = {continuation: continuation};

        var isAccepted = collection.queryDocuments(collectionLink, query, requestOptions, function (err, retrievedDocs, responseOptions) {
            if (err) throw err;

            if (retrievedDocs.length > 0) {
                // Begin deleting documents as soon as documents are returned form the query results.
                // tryDelete() resumes querying after deleting; no need to page through continuation tokens.
                //  - this is to prioritize writes over reads given timeout constraints.
                tryDelete(retrievedDocs);
            } else if (responseOptions.continuation) {
                // Else if the query came back empty, but with a continuation token; repeat the query w/ the token.
                tryQueryAndDelete(responseOptions.continuation);
            } else {
                // Else if there are no more documents and no continuation token - we are finished deleting documents.
                responseBody.continuation = false;
                response.setBody(responseBody);
            }
        });

        // If we hit execution bounds - return continuation: true.
        if (!isAccepted) {
            response.setBody(responseBody);
        }
    }

    // Recursively deletes documents passed in as an array argument.
    // Attempts to query for more on empty array.
    function tryDelete(documents) {
        if (documents.length > 0) {
            // Delete the first document in the array.
            var isAccepted = collection.deleteDocument(documents[0]._self, {}, function (err, responseOptions) {
                if (err) throw err;

                responseBody.deleted++;
                documents.shift();
                // Delete the next document in the array.
                tryDelete(documents);
            });

            // If we hit execution bounds - return continuation: true.
            if (!isAccepted) {
                response.setBody(responseBody);
            }
        } else {
            // If the document array is empty, query for more documents.
            tryQueryAndDelete();
        }
    }
}

I'm not sure if I am doing anything wrong or it the performance just isn't there with cosmos but I'm finding it quite difficult to achieve what I'm looking for, any advice is greatly appreciated.

Have you tried setting the TTL to zero instead of deleting them immediately? — Stephen Cleary, Aug 12 '21 at 22:12
would updating the documents be much faster than deleting them? i can try this but I'm skeptical of the performance gains. — toubi, Aug 12 '21 at 23:57
There is no bulk delete, and a stored procedure will only delete within a logical partition. Perhaps it's more efficient to preserve data you want to keep, into another collection (via changefeed or some other mechanism), and then delete the original collection? — David Makogon, Aug 13 '21 at 01:04
@toubi: I think it would depend. At the least, it wouldn't have to update indexes as part of that call. — Stephen Cleary, Aug 13 '21 at 01:41
If you want to delete entire partitions, this feature is in preview. Might be worth getting involved to try it for your case: https://aka.ms/cosmosPkDeleteSignup — Noah Stahl, Aug 14 '21 at 14:38
Thanks ill take a look into this preview, see if it can help me out here — toubi, Aug 14 '21 at 23:24

Deleting Millions of Documents in cosmos in a reasonable amount of time

0 Answers0