2

Similar to this question, I'm trying to iterate only distinct values of sub-string of given strings, for example:

List<string> keys = new List<string>()
{
    "foo_boo_1",
    "foo_boo_2,
    "foo_boo_3,
    "boo_boo_1"
}

The output for the selected distinct values should be (select arbitrary the first sub-string's distinct value):

foo_boo_1 (the first one)
boo_boo_1

I've tried to implement this solution using the IEqualityComparer with:

public class MyEqualityComparer : IEqualityComparer<string>
{
    public bool Equals(string x, string y)
    {            
        int xIndex = x.LastIndexOf("_"); 
        int yIndex = y.LastIndexOf("_");
        if (xIndex > 0 && yIndex > 0)
            return x.Substring(0, xIndex) == y.Substring(0, yIndex);
        else
            return false;
    }

    public int GetHashCode(string obj)
    {
        return obj.GetHashCode();
    }
}

foreach (var key in myList.Distinct(new MyEqualityComparer()))
{
    Console.WriteLine(key)    
}

But the resulted output is:

foo_boo_1
foo_boo_2
foo_boo_3
boo_boo_1

Using the IEqualityComparer How do I remove the sub-string distinct values (foo_boo_2 and foo_boo_3)?

*Please note that the "real" keys are a lot longer, something like "1_0_8-B153_GF_6_2", therefore I must use the LastIndexOf.

Shahar Shokrani
  • 7,598
  • 9
  • 48
  • 91

3 Answers3

1

Your GetHashCode method in your equality comparer is returning the hash code for the entire string, just make it hash the substring, for example:

public int GetHashCode(string obj)
{
    var index = obj.LastIndexOf("_");
    return obj.Substring(0, index).GetHashCode();
}
DavidG
  • 113,891
  • 12
  • 217
  • 223
1

Your current implementation has some flaws:

  1. Both Equals and GetHashCode must never throw exception (you have to check for null)
  2. If Equals returns true for x and y then GetHashCode(x) == GetHashCode(y). Counter example is "abc_1" and "abc_2".

The 2nd error can well cause Distinct return incorrect results (Distinct first compute hash).

Correct code can be something like this

public class MyEqualityComparer : IEqualityComparer<string> {
  public bool Equals(string x, string y) {            
    if (ReferenceEquals(x, y))
      return true;
    else if ((null == x) || (null == y))
      return false;

    int xIndex = x.LastIndexOf('_'); 
    int yIndex = y.LastIndexOf('_');

    if (xIndex >= 0)         
      return (yIndex >= 0) 
        ? x.Substring(0, xIndex) == y.Substring(0, yIndex)
        : false;
    else if (yIndex >= 0)         
      return false;
    else
      return x == y; 
  }

  public int GetHashCode(string obj) {
    if (null == obj)  
      return 0;

    int index = obj.LastIndexOf('_');

    return index < 0 
      ? obj.GetHashCode() 
      : obj.Substring(0, index).GetHashCode();
  }
}

Now you are ready to use it with Distinct:

   foreach (var key in myList.Distinct(new MyEqualityComparer())) {
     Console.WriteLine(key)    
   }
Shahar Shokrani
  • 7,598
  • 9
  • 48
  • 91
Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
  • Hey @dmitry, great answer. can you please explain why the breakpoint does not break inside the `Equals(...)`? – Shahar Shokrani Mar 12 '20 at 10:53
  • 1
    @Shahar Shokrani: `Distinct` compares `x` and `y` in **2** stages: 1st it compares *hashes*; if hash codes are different there's not need to call `Equals` and only if hash codes are *equal* it runs `Equals`. Our hashes are good, and it can well appear we don't need `Equals` at all – Dmitry Bychenko Mar 12 '20 at 11:00
1

For a more succinct solution that avoids using a custom IEqualityComparer<>, you could utilise GroupBy. For example:

var keys = new List<string>()
{
    "foo_boo_1",
    "foo_boo_2",
    "foo_boo_3",
    "boo_boo_1"
};

var distinct = keys
    .Select(k => new
    {
        original = k,
        truncated = k.Contains("_") ? k.Substring(0, k.LastIndexOf("_")) : k
    })
    .GroupBy(k => k.truncated)
    .Select(g => g.First().original);

This outputs:

foo_boo_1

boo_boo_1

Community
  • 1
  • 1
Oliver
  • 8,794
  • 2
  • 40
  • 60