0

I want to remove duplicate records using Entity Framework.

This is what I've tried

var result = _context.History
            .GroupBy(s => new
                    {
                        s.Date,
                        s.EventId
                    })
            .SelectMany(grp => grp.Skip(1)).ToList();

_context.History.RemoveRange(result);
await _context.SaveChangesAsync();

But I get an error

System.InvalidOperationException: Processing of the LINQ expression 'grp => grp.Skip(1)' by 'NavigationExpandingExpressionVisitor' failed. This may indicate either a bug or a limitation in EF Core

I understand that this is breaking change for Entity Framework, but I really don't know how to update my code.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Gerald Hughes
  • 5,771
  • 20
  • 73
  • 131
  • 1
    Don't use EF Core in the first place. EF Core is an ORM, not a SQL replacement. There are no Objects here, and the easiest and most efficient way to remove duplicates involves a CTE with `ROW_NUMBER()` that would return all multiples, ranked by whatever sort order you want, allowing you to select which row to keep – Panagiotis Kanavos Nov 22 '20 at 17:53
  • Eg `with dups as (select *, row_number() over (partition by date,eventid order by id desc) rn from...) delete dups where rn>1` will delete all duplicates except the largest id. The CTE doesn't need to return all columns, just the key columns are enough. You can specify a different `ORDER BY` to select different rows to preserve – Panagiotis Kanavos Nov 22 '20 at 17:56
  • 1
    @PanagiotisKanavos Is this CTE database agnostic or just SqlServer specific? ORM might not be a SQL replacement, but LINQ is supposed to be abstraction and database agnostic language integrated query language, so why don't use it? The fact that EF Core breaks the contract by not willing to translate it doesn't mean OP is doing something wrong. – Ivan Stoev Nov 22 '20 at 18:02
  • @IvanStoev the operation isn't object-agnostic. There are no objects involved. LINQ wasn't meant to handle such situations. ORMs were never meant for reporting queries or fully replacing SQL. If you replace `database agnostic` with `ANSI standard`, yes, it's ANSI standard and even supported in MySQL after MySQL 8. All other major databases had `ROW_NUMBER()` already – Panagiotis Kanavos Nov 22 '20 at 18:04
  • @IvanStoev and SQLite added windowing functions in [version 3.25](https://www.sqlite.org/releaselog/3_25_0.html). Besides, what the OP is trying to do doesn't make sense in SQL - that group isn't really grouping and there's no SKIP in SQL. This is trying to apply (somewhat inefficient) LINQ-to-Objects operation to a database hoping that EF Core can somehow translate this to SQL – Panagiotis Kanavos Nov 22 '20 at 18:07
  • @PanagiotisKanavos How about NoSQL databases? And why is LINQ not meant to handle *such* situations? What's the purpose of LINQ to `IQueryable` then? And if this is so "standard" functionality, ORM is obliged to provide abstract access to it, for instance with custom extensions similar to linq2db. People using ORM are not supposed to know SQL - period. – Ivan Stoev Nov 22 '20 at 18:22
  • @IvanStoev no it's not, because an ORM is an Object-to-Relational Mapper, not a SQL replacement. It's only meant to Map Objects to Relational constructs. LINQ goes beyond that, but only up to a point. Other ORMs, especially micro-ORMs, or ORMs in other languages (eg GORM, Rails, even NHibernate) don't even go that far – Panagiotis Kanavos Nov 22 '20 at 18:24
  • @IvanStoev EF Core and most ORMs can't even handle DELETEs unless they refer to specific IDs. The attempt to make ORMs do everything a database does is what Ted Neward described as [The Vietnam of Computer Science](http://blogs.tedneward.com/post/the-vietnam-of-computer-science/) as far back as 2006. Rails created the illusion that you don't need to know SQL - mainly by hiding the features it couldn't handle. In this particular case, raw SQL or through a microORM like Dapper to allow easy parameterization would be far better than trying to emulate SQL through the ORM's query language – Panagiotis Kanavos Nov 22 '20 at 18:25
  • @PanagiotisKanavos *"This is trying to apply (somewhat inefficient) LINQ-to-Objects operation to a database hoping that EF Core can somehow translate this to SQL"* The same is with SQL `left outer join` - LINQ has no such operator, but at least two patterns are recognized and translated correctly. This Top N per group pattern is soo common, EF Core is obliged to recognize it and translate to your favorite `row_number() over ` pattern or whatever is supported by the underlying database. – Ivan Stoev Nov 22 '20 at 18:32
  • The open issue is here - [Support ability to select top N of each group #13805](https://github.com/dotnet/efcore/issues/13805). Go and vote for it. – Ivan Stoev Nov 22 '20 at 18:42

2 Answers2

1

Looks like Entity Framework doesn't know how to translate this Skip part of LINQ query. Moreover, it cannot make translate this GroupBy part. In EF Core 3 it will throw an exception to let us know :)

So, a dirty but simple way is to add AsEnumerable almost at the beginning, however, it will fetch all table and perform operations in memory:

var result = _context.History
            .AsEnumerable()
            .GroupBy(s => new { s.Date, s.EventId })
            .SelectMany(g => g.Skip(1))
            .ToList();

_context.History.RemoveRange(result);
await _context.SaveChangesAsync();

Since in most cases it's not acceptable to fetch everything we can split first request into two so that we download only duplicated records.

Second answer of this question might help, we can try something like this:

var keys = _context.History
                .GroupBy(s => new { s.Date, s.EventId })
                .Select(g => new { g.Key, Count = g.Count() })
                .Where(t => t.Count > 1)
                .Select(t => new { t.Key.Date, t.Key.EventId })
                .ToList();

var result = _context.History
    .Where(h => keys.Any(k => k.Date == h.Date && k.EventId == h.EventId))
    .AsEnumerable()
    .GroupBy(s => new { s.Date, s.EventId })
    .SelectMany(g => g.Skip(1))
    .ToList();

_context.History.RemoveRange(result);
await _context.SaveChangesAsync();
ochzhen
  • 156
  • 2
  • 5
  • Hello, welcome to SO. This is the error I get `System.InvalidOperationException: Client side GroupBy is not supported.` – Gerald Hughes Nov 23 '20 at 07:24
  • Interesting, I didn't know it but looks like in EF Core 3 they added an explicit error since `GroupBy` is not being translated to SQL, second answer here is quite good: https://stackoverflow.com/questions/58138556/client-side-groupby-is-not-supported The easiest solution is to move `AsEnumerable()` to the top right after `_context.History`. However, it will fetch all data from this table to the server and perform everything in memory. Is it acceptable in your case? – ochzhen Nov 23 '20 at 16:58
  • I've updated the answer so that it's easier for you to understand my previous comment. It might help :) – ochzhen Nov 23 '20 at 17:34
1

In this case you are grouping by both columns:

var duplicate = DB.History.GroupBy(x => new { x.Date, x.EventId})
                         .Where(x => x.Count() > 1)
                         .SelectMany(x => x.ToList());
Milos Gak
  • 21
  • 1
  • 4