6

How have I to implement IEqualityComparer<DataRow> to remove duplicates rows from a DataTable with next structure:

ID primary key, col_1, col_2, col_3, col_4

The default comparer doesn't work because each row has it's own, unique primary key.

How to implement IEqualityComparer<DataRow> that will skip primary key and compare only data remained.

I have something like this:

public class DataRowComparer : IEqualityComparer<DataRow>
{
 public bool Equals(DataRow x, DataRow y)
 {
  return
   x.ItemArray.Except(new object[] { x[x.Table.PrimaryKey[0].ColumnName] }) ==
   y.ItemArray.Except(new object[] { y[y.Table.PrimaryKey[0].ColumnName] });
 }

 public int GetHashCode(DataRow obj)
 {
  return obj.ToString().GetHashCode();
 }
}

and

public static DataTable RemoveDuplicates(this DataTable table)
{
  return
    (table.Rows.Count > 0) ?
  table.AsEnumerable().Distinct(new DataRowComparer()).CopyToDataTable() :
  table;
}

but it calls only GetHashCode() and doesn't call Equals()

Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
abatishchev
  • 98,240
  • 88
  • 296
  • 433

1 Answers1

5

That is the way Distinct works. Intenally it uses the GetHashCode method. You can write the GetHashCode to do what you need. Something like

public int GetHashCode(DataRow obj)
{
    var values = obj.ItemArray.Except(new object[] { obj[obj.Table.PrimaryKey[0].ColumnName] });
    int hash = 0;
    foreach (var value in values)
    {
        hash = (hash * 397) ^ value.GetHashCode();
    }
    return hash;
}

Since you know your data better you can probably come up with a better way to generate the hash.

abatishchev
  • 98,240
  • 88
  • 296
  • 433
Mike Two
  • 44,935
  • 9
  • 80
  • 96
  • It's always a good idea to have your equal and hash functions in sync, e.g. equals should never return true when the hash codes are not identical. Btw. my guess is that Equals() will still be called when GetHashCode() returns the same thing (since hashes can collide), so you could maybe cheat and always return a dummy hash. But don't do it. – HerdplattenToni Oct 21 '09 at 08:53
  • 3
    This isn't just "a good idea", it's a recommended practice. MSDN says Types that override Equals must also override GetHashCode." – Thomas Levesque Oct 21 '09 at 08:57
  • why exactly 397? how about using primary key if it's an INT? – abatishchev Oct 21 '09 at 09:17
  • 2
    @abatishchev - 397 is just what ReSharper picks. A not-tiny prime number will do. I left Primary key out because the point is to find duplication in the non-primary key column. – Mike Two Oct 21 '09 at 09:31
  • you explicitly DON'T want to use the primary key because you want same hash codes if the rows are duplicates (in your sense) @Thomas: yeah, I hope that most "always good idea" things are a recommended practice. Or will become one some day. – HerdplattenToni Oct 21 '09 at 09:32