1

How do I delete duplicate documents based on a document attribute value? For example, documents in a collection are as given below

    [  {
    "ProductIdentifier": "A100",
    "ProductTitle": "Product A",
    "_ts": 1491664477
  },
  {
    "ProductIdentifier": "A100",
    "ProductTitle": "Product A"
    "_ts": 1491664466
  }
  {
    "ProductIdentifier": "B100",
    "ProductTitle": "Product B"
    "_ts": 1491664477
  }
]

I want to delete the second document as it is the same as first document (based on ProductIdentifier) and has a lower timestamp (based on _ts)

There are quite a lot of such duplicate documents in the collection. What is the efficient way to do it in bulk?

Alvin PL
  • 11
  • 2

1 Answers1

0

It seems that you’d like to group your data and delete duplicates (that with same ProductIdentifier value and lower timestamp) from each group. As far as I know, currently GroupBy does not be supported in DocumentDB. But you can group the data and get the ProductIdentifier from each group via LINQ, and then query documents with same ProductIdentifier and delete the duplicates.

var query = client.CreateDocumentQuery<MyDoc>(UriFactory.CreateDocumentCollectionUri("testdb", "testcoll")).Where(d => d.ProductIdentifier != "");

List<MyDoc> list1 = query.ToList();

var result = list1.GroupBy(item => new
{
    ProductIdentifier = item.ProductIdentifier,
    ProductTitle = item.ProductTitle
})
.Select(group => new
{
    ProductIdentifier = group.Key.ProductIdentifier,
    ProductTitle = group.Key.ProductTitle
});

foreach (var item in result)
{
    var query1 = client.CreateDocumentQuery<MyDoc>(UriFactory.CreateDocumentCollectionUri("testdb", "testcoll")).Where(d => d.ProductIdentifier == item.ProductIdentifier && d.ProductTitle == item.ProductTitle);

    if (query1.Count() > 1)
    {
       //delete duplicates from a group
    }

}

Besides, as Larry Maccherone said in this thread, documentdb-lumenize is an aggregation library for DocumentDB written as a stored procedure, which can help us perform GroupBy.

string configString = @"{
    cubeConfig: {
        groupBy: 'ProductIdentifier', 
        field: '_ts', 
        f: 'max'
    }, 
    filterQuery: 'SELECT * FROM c'
}";
Object config = JsonConvert.DeserializeObject<Object>(configString);
dynamic result = await client.ExecuteStoredProcedureAsync<dynamic>(UriFactory.CreateStoredProcedureUri("testdb", "testcoll", "cube"), config);
//get group info form result.Response
Fei Han
  • 26,415
  • 1
  • 30
  • 41