Delete newer duplicate value based on two columns .NET Core

Question

I have huge transactions table in azure database, where we import files with +1 million objects.

public class Transaction
    {
        [Key]
        public int Id { get; set; }
        public int TransactionId { get; set; }
        public DateTime Date { get; set; }
        public decimal Price { get; set; }
        public int UserId { get; set; }
        public string Product { get; set; }
        public int ClientId { get; set; }
        public int Uploaed { get; set; }
        public string UniqueId { get; set; }
        public string Custom1 { get; set; }
        public string Custom2 { get; set; }
        public string Custom3{ get; set; }

    }

after importing all new data I take all new transaction ids, and take all transaction ids for that client from database.

        // ids from import
        string transactionsString = string.Join(",", transactionIdsCsv);
        var result = await   _transactionsDataRepository.GetByTransactionIdsAndClientId(transactionIdsCsv.ToArray(), clientId);
        // ids from repository
        string transactionsDBString = string.Join(",", result.ToList());

        // remove rows in db where duplicate transactions ids and clientId=ClientId

but I am struggling to find the most effective way. I wanted to do something like delete from transactions where transactionId IN (transactionsDBString) and clientId = ClientID but that would delete both values and I only want new value to be deleted (and old value to stay)

but would that be a good way? even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows.

Xerillio · Accepted Answer · 2020-06-03T19:53:38.730

I only want new value to be deleted (and old value to stay)

Since you already know how to identify the transaction IDs you want to delete you could delete the necessary rows while keeping the latest like so (you didn't mention it but I'm assuming you're using Entity Framework - given your use of the [Key] attribute - correct me if I'm wrong):

var transToRemove = dbContext.Transactions
    .Where(t => t.ClientId == clientId && transIds.Contains(t.TransactionId))
    .GroupBy(t => t.TransactionId, t => t) // Group transactions with the same TransactionId
    .SelectMany(
        group => group.OrderBy(t => t.Date) // Order the oldest first
                     .Skip(1) // Skip the oldest (we want to keep it)
    );

dbContext.Transactions.RemoveRange(transToRemove);
dbContext.SaveChanges();

Edit: Included an example that should work for Dapper...

var cn = // Create your DbConnection

// This query should select all transactions you want to delete excluding
// those with the oldest Date. This is just like 'transToRemove' above
var selectQuery = @"
SELECT t1.Id FROM Transactions t1
INNER JOIN (
    SELECT
        MIN(tInner.Date) AS FirstTransDate,
        tInner.TransactionId,
        tInner.ClientId
    FROM Transactions tInner
    WHERE tInner.ClientId = @clientId
        AND tInner.TransactionId IN @transIds
    GROUP BY tInner.TransactionId, tInner.ClientId
) t2 ON t2.ClientId = t1.ClientId AND t2.TransactionId = t1.TransactionId
WHERE t1.Date != t2.FirstTransDate
";
var idsToDelete = cn.Query<int>(
    selectQuery,
    new { clientId, transIds }).ToList();

// Delete the whole list in one go
cn.Execute("DELETE FROM Transactions WHERE Id in @idsToDelete", new {idsToDelete});

(inspiration from here and here)

I haven't tested this using Dapper but the list of idsToDelete should be correct according to this fiddle I made. A couple things to note:

Depending on how long your list of transIds is (I believe those ID's are in result in your own example) you might want to repeat this in smaller batches instead of trying to delete the whole list in one go.
The SQL query above doesn't take into account if two "duplicate" transactions have the same "oldest" Date. If that can happen in your table, then this query will only remove all "duplicate" rows apart from those two.

Improvements

There are a couple of things that seem a little out of place with your setup that I think you should consider:

even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows

Millions of rows should not be an issue for any decent database server to handle. It sounds like you are missing some indexes on your table. With proper indexes your queries should be pretty swift as long as you can keep them simple.

but would that be a good way?

Not quite sure what you are referring to being good or bad here, but I'll interpret a little... Right now you are writing tons of rows to the table that seems to contain duplicate data. When I think of a transaction-based system, no two transactions should share the same ID. That means for two different ClientIds there should never be a case where t1.TransactionId == t2.TransactionId. Now you can avoid checking ClientId in my code-snippet above.

Since you want to only keep 1 transaction for each TransactionId will you ever need to have two transactions with the same TransactionId? If not, then you can go even further and make the TransactionId column unique and avoid inserting two rows with the same TransactionId. You can use the Entity Framework [Index(IsUnique=true)] attribute to also create an index to speed up queries on that column/property.

Does your method stream the `transToRemove` to the appplication and delete them one at a time, or does it send in a query and delete them DB side? (I believe it's the former) — Jamie Twells, Jun 02 '20 at 21:16
Thanks so much for the answer! I'm using dapper in my net core application, not dbcontext. Also I cannot omit clientId check since, different clients can import any files with any ids, and they could be same. (Id however not since its auto increment) I'm not sure if it's my azure database issue or missing indexes but i needed to fetch 11milion ids for my result list and it took ages — user122222, Jun 02 '20 at 21:36
@JamieTwells I believe it gets loaded into the application's memory before the delete query is run on the SQL server. [See here](https://stackoverflow.com/a/48443963/3034273) how you can improve that. — Xerillio, Jun 02 '20 at 22:18
@user122222 ok, that's of course a limitation regarding the clients. You should still be able to create indexes on those columns to speed up those queries. I haven't used Dapper, but I'm sure you should be able to do something similar whether you need to convert it to an SQL query or otherwise. I'll take a look at a Dapper solution tomorrow if my example is not enough to lead you on — Xerillio, Jun 02 '20 at 22:21
@user122222 Take a look at my edit, I hope this works in Dapper — Xerillio, Jun 03 '20 at 19:54

Delete newer duplicate value based on two columns .NET Core

1 Answers1

Improvements