I only want new value to be deleted (and old value to stay)
Since you already know how to identify the transaction IDs you want to delete you could delete the necessary rows while keeping the latest like so (you didn't mention it but I'm assuming you're using Entity Framework - given your use of the [Key]
attribute - correct me if I'm wrong):
var transToRemove = dbContext.Transactions
.Where(t => t.ClientId == clientId && transIds.Contains(t.TransactionId))
.GroupBy(t => t.TransactionId, t => t) // Group transactions with the same TransactionId
.SelectMany(
group => group.OrderBy(t => t.Date) // Order the oldest first
.Skip(1) // Skip the oldest (we want to keep it)
);
dbContext.Transactions.RemoveRange(transToRemove);
dbContext.SaveChanges();
Edit: Included an example that should work for Dapper...
var cn = // Create your DbConnection
// This query should select all transactions you want to delete excluding
// those with the oldest Date. This is just like 'transToRemove' above
var selectQuery = @"
SELECT t1.Id FROM Transactions t1
INNER JOIN (
SELECT
MIN(tInner.Date) AS FirstTransDate,
tInner.TransactionId,
tInner.ClientId
FROM Transactions tInner
WHERE tInner.ClientId = @clientId
AND tInner.TransactionId IN @transIds
GROUP BY tInner.TransactionId, tInner.ClientId
) t2 ON t2.ClientId = t1.ClientId AND t2.TransactionId = t1.TransactionId
WHERE t1.Date != t2.FirstTransDate
";
var idsToDelete = cn.Query<int>(
selectQuery,
new { clientId, transIds }).ToList();
// Delete the whole list in one go
cn.Execute("DELETE FROM Transactions WHERE Id in @idsToDelete", new {idsToDelete});
(inspiration from here and here)
I haven't tested this using Dapper but the list of idsToDelete
should be correct according to this fiddle I made. A couple things to note:
- Depending on how long your list of
transIds
is (I believe those ID's are in result
in your own example) you might want to repeat this in smaller batches instead of trying to delete the whole list in one go.
- The SQL query above doesn't take into account if two "duplicate" transactions have the same "oldest" Date. If that can happen in your table, then this query will only remove all "duplicate" rows apart from those two.
Improvements
There are a couple of things that seem a little out of place with your setup that I think you should consider:
even fetching var result = await _transactionsDataRepository...
can take a lot of time since there are millions of rows
Millions of rows should not be an issue for any decent database server to handle. It sounds like you are missing some indexes on your table. With proper indexes your queries should be pretty swift as long as you can keep them simple.
but would that be a good way?
Not quite sure what you are referring to being good or bad here, but I'll interpret a little... Right now you are writing tons of rows to the table that seems to contain duplicate data. When I think of a transaction-based system, no two transactions should share the same ID. That means for two different ClientId
s there should never be a case where t1.TransactionId == t2.TransactionId
. Now you can avoid checking ClientId
in my code-snippet above.
Since you want to only keep 1 transaction for each TransactionId
will you ever need to have two transactions with the same TransactionId
? If not, then you can go even further and make the TransactionId
column unique and avoid inserting two rows with the same TransactionId
. You can use the Entity Framework [Index(IsUnique=true)]
attribute to also create an index to speed up queries on that column/property.