Selecting distinct rows in Entity Framework

Question

I'm trying to query an SQL Server view with Entity Framework and only return rows that are distinct based on multiple columns.

I've tried a solution based on this answer (GroupBy and then Select(g => g.FirstOrDefault()), but I'm still getting duplicate rows.

Table structure (this is a fairly complex view in the real database, but the final output is similar in structure to this example):

CREATE TABLE Example (
    ID_A BIGINT,
    ID_B BIGINT,
    ID_C BIGINT,

    Type_A NVARCHAR(50),
    Type_B NVARCHAR(50),

    ID_Z BIGINT,

    Foo NVARCHAR(200),
    Bar NVARCHAR(200)
)

Example data:

INSERT INTO Example (ID_A, ID_B, ID_C, Type_A, Type_B,  ID_Z,  Foo, Bar)
VALUES (1, 1, 1, 'TypeA1', 'TypeB1',  1,  'foo1', 'bar1'), -- This row and the next one represent the same main record (1) joined to different "Z" records (1 and 2)
       (1, 1, 1, 'TypeA1', 'TypeB1',  2,  'foo1', 'bar1'),
       (2, 1, 2, 'TypeA2', 'TypeA2',  1,  'foo2', 'bar2'), -- This row and the next two represent the same main record (2) joined to different "Z" records (1, 2 and 3)
       (2, 1, 2, 'TypeA2', 'TypeA2',  2,  'foo2', 'bar2'),
       (2, 1, 2, 'TypeA2', 'TypeA2',  3,  'foo2', 'bar2')

Entity class:

public class ExampleEntity
{
    [Key]
    public long ID_A { get; set; }
    public long ID_B { get; set; }
    public long ID_C { get; set; }

    public string Type_A { get; set; }
    public string Type_B { get; set; }

    public long? ID_Z { get; set; }

    public string Foo { get; set; }
    public string Bar { get; set; }

Entity configuration:

public class ExampleEntityConfiguration : EntityTypeConfiguration<ExampleEntity>
{
    public ExampleEntityConfiguration()
    {
        // Properties
        this.Property(t => t.ID_A)
            .HasDatabaseGeneratedOption(DatabaseGeneratedOption.None);

        // Table & Column Mappings
        this.ToTable("Example");
        this.Property(t => t.ID_A).HasColumnName("ID_A");
        this.Property(t => t.ID_B).HasColumnName("ID_B");
        this.Property(t => t.ID_C).HasColumnName("ID_C");

        this.Property(t => t.Type_A).HasColumnName("Type_A");
        this.Property(t => t.Type_B).HasColumnName("Type_B");

        this.Property(t => t.ID_Z).HasColumnName("ID_Z");

        this.Property(t => t.Foo).HasColumnName("Foo");
        this.Property(t => t.Bar).HasColumnName("Bar");
    }
}

ID_A, ID_B, ID_C, Type_A and Type_B identify the "main" entity and ID_Z identifies the joined "Z" entity. Foo and Bar are non-unique data columns that need to be included in the final results.

For each combination of main ID/Type values, there can be multiple ID_Z values. I need to filter by the ID_Z values and then return the distinct main entity values (based on the ID/Type values).

I've tried a query like the following:

// The `ID_Z` values to filter on
var zIDs = new List<long> { 1, 2 };

var result = context.Set<ExampleEntity>()
       .Where(e => zIDs.Contains(e.ID_Z))
       .GroupBy(e => new { e.ID_A, e.ID_B, e.ID_C, e.Type_A, e.Type_B })
       .Select(g => g.FirstOrDefault())
       .ToList();

Which results in this SQL:

SELECT 
    [Limit1].[ID_A] AS [ID_A], 
    [Limit1].[ID_B] AS [ID_B], 
    [Limit1].[ID_C] AS [ID_C], 
    [Limit1].[Type_A] AS [Type_A], 
    [Limit1].[Type_B] AS [Type_B], 
    [Limit1].[ID_Z] AS [ID_Z], 
    [Limit1].[Foo] AS [Foo], 
    [Limit1].[Bar] AS [Bar]
FROM   (
    SELECT [Extent1].[ID_A] AS [ID_A], [Extent1].[ID_B] AS [ID_B], [Extent1].[ID_C] AS [ID_C], [Extent1].[Type_A] AS [Type_A], [Extent1].[Type_B] AS [Type_B]
    FROM [dbo].[Example] AS [Extent1] WITH (NOLOCK)
    WHERE 
        ([Extent1].[ID_Z] IN (cast(1 as bigint), cast(2 as bigint))) AND ([Extent1].[ID_Z] IS NOT NULL)
) AS [Filter1]
OUTER APPLY  (
    SELECT TOP (1) 
        [Extent2].[ID_A] AS [ID_A], 
        [Extent2].[ID_B] AS [ID_B], 
        [Extent2].[ID_C] AS [ID_C], 
        [Extent2].[Type_A] AS [Type_A], 
        [Extent2].[Type_B] AS [Type_B], 
        [Extent2].[ID_Z] AS [ID_Z], 
        [Extent2].[Foo] AS [Foo], 
        [Extent2].[Bar] AS [Bar]
     FROM [dbo].[Example] AS [Extent2] WITH (NOLOCK)
     WHERE 
        ([Extent2].[ID_Z] IN (cast(1 as bigint), cast(2 as bigint))) AND ([Extent2].[ID_Z] IS NOT NULL) AND 
        ([Filter1].[ID_A] = [Extent2].[ID_A]) AND 
        ([Filter1].[ID_B] = [Extent2].[ID_B]) AND 
        ([Filter1].[ID_C] = [Extent2].[ID_C]) AND 
        (([Filter1].[Type_A] = [Extent2].[Type_A]) OR (([Filter1].[Type_A] IS NULL) AND ([Extent2].[Type_A] IS NULL))) AND 
        (([Filter1].[Type_B] = [Extent2].[Type_B]) OR (([Filter1].[Type_B] IS NULL) AND ([Extent2].[Type_B] IS NULL))) 
) AS [Limit1]

But this seems to return all of the rows matching the Z_ID filter (resulting in duplicate "main" values) instead of only returning the first row for each set of "main" ID/Type values.

If I materialise (ToList) the query directly after GroupBy, I seem to get the correct groupings; but I'd like to run this all in the DB and avoid using LINQ to Objects queries.

How can I create this query?

What version of EF are you using? Because I'm fairly certain that in EF Core 3.0 `.Distinct()` gets correctly translated to SQL. — V0ldek, Oct 09 '19 at 07:04
If I'm not mistaken, `Distinct()` would operate on every column; but I only want the distinct check on a subset of the columns. — user10998602, Oct 09 '19 at 07:07
Can we see the `ExampleEntity` class and configuration, especially the "primary key". — Ivan Stoev, Oct 09 '19 at 09:25
@IvanStoev `ExampleEntity` represents a view consisting of unions between multiple main entity tables joined to their related tables. There's no single primary key, the closest thing is the five ID/Type columns that I mention in the question. — user10998602, Oct 09 '19 at 13:02
I understand that, that's why asking - EF6 requires a primary key, so you should have been forced to configure some fake composite primary key, and I wanted to see that. Because it might affect the generated SQL, hence the result of the query. Also can you post the generated SQL? — Ivan Stoev, Oct 09 '19 at 13:25
I'll check what the primary key is configured as and see if I can post the generated SQL tomorrow, though the code and database structure are commercial/private so I'll probably only be able to post an outline. — user10998602, Oct 09 '19 at 13:43
@IvanStoev I've updated the question with the entity class, the entity configuration and the generated SQL. — user10998602, Oct 10 '19 at 04:21
As I was expecting, the problem is caused by the PK definition of the entity. By marking `ID_A` as PK essentially you are telling EF that this field is unique, hence it will assume any combination of fields including that field as unique, and will remove the normal `GroupBy` / `Distinct` operator. You really need to define a real unique key as PK for that entity. Eventually including all fields, but that doesn't work with nullable fields. Better extend the database view to include `ROW_NUMBER` or `NEWID` based calculated column and map it as PK. — Ivan Stoev, Oct 10 '19 at 15:58
@IvanStoev That was the issue, thanks. Do you want to submit that as an answer, or should I? — user10998602, Oct 11 '19 at 01:06

score 1 · Accepted Answer · answered Oct 15 '19 at 03:20

(Because apparently this was never done, I'm copying Ivan Stoev's comment below as "the answer")

As I was expecting, the problem is caused by the PK definition of the entity. By marking ID_A as PK essentially you are telling EF that this field is unique, hence it will assume any combination of fields including that field as unique, and will remove the normal GroupBy / Distinct operator. You really need to define a real unique key as PK for that entity. Eventually including all fields, but that doesn't work with nullable fields. Better extend the database view to include ROW_NUMBER or NEWID based calculated column and map it as PK. – Ivan Stoev Oct 10 at 15:58

Selecting distinct rows in Entity Framework

1 Answers1