C# 3.0: Need to return duplicates from a List<>

Question

I have a List<> of objects in C# and I need a way to return those objects that are considered duplicates within the list. I do not need the Distinct resultset, I need a list of those items that I will be deleting from my repository.

For the sake of this example, lets say I have a list of "Car" types and I need to know which of these cars are the same color as another in the list. Here are the cars in the list and their color property:

Car1.Color = Red;

Car2.Color = Blue;

Car3.Color = Green;

Car4.Color = Red;

Car5.Color = Red;

For this example I need the result (IEnumerable<>, List<>, or whatever) to contain Car4 and Car5 because I want to delete these from my repository or db so that I only have one car per color in my repository. Any help would be appreciated.

score 29 · Accepted Answer · edited Jan 29 '09 at 22:26

29

I inadvertently coded this yesterday, when I was trying to write a "distinct by a projection". I included a ! when I shouldn't have, but this time it's just right:

public static IEnumerable<TSource> DuplicatesBy<TSource, TKey>
    (this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
    HashSet<TKey> seenKeys = new HashSet<TKey>();
    foreach (TSource element in source)
    {
        // Yield it if the key hasn't actually been added - i.e. it
        // was already in the set
        if (!seenKeys.Add(keySelector(element)))
        {
            yield return element;
        }
    }
}

You'd then call it with:

var duplicates = cars.DuplicatesBy(car => car.Color);

edited Jan 29 '09 at 22:26

Joel Coehoorn

399,467
113
570
794

answered Jan 29 '09 at 22:21

Jon Skeet

1,421,763
867
9,128
9,194

Thanks a ton for this answer Jon. Helped me to optimize my ways of finding duplicates in a list. – Dienekes Dec 06 '10 at 10:54
Resharper tells me to refactor the foreach in your code with: return source.Where(element => !seenKeys.Add(keySelector(element))); – Koen Jan 13 '11 at 16:11
@Koen: Ick - I dare say it would work, but I don't like the idea of including a side-effect in a predicate. (That would also change the timing of when the hashset was created, but that's a minor matter.) – Jon Skeet Jan 13 '11 at 16:18
2

One other thing: all elements with the same key but the first (in loop order) are returned (so if you have 3 duplicates, 2 elements are returned), wich may seem like odd behavior to me. Either return all duplicates or return only the key once... – Koen Jan 13 '11 at 16:29

Greg Beech · Answer 2 · 2009-01-29T22:34:48.507

var duplicates = from car in cars
                 group car by car.Color into grouped
                 from car in grouped.Skip(1)
                 select car;

This groups the cars by color and then skips the first result from each group, returning the remainder from each group flattened into a single sequence.

If you have particular requirements about which one you want to keep, e.g. if the car has an Id property and you want to keep the car with the lowest Id, then you could add some ordering in there, e.g.

var duplicates = from car in cars
                 group car by car.Color into grouped
                 from car in grouped.OrderBy(c => c.Id).Skip(1)
                 select car;

+1 - I didn't think of the Skip(1) because the asker only wanted the duplicates. — Matt Hamilton, Jan 29 '09 at 22:29

score 5 · Answer 3 · answered Jan 30 '09 at 14:51

Here's a slightly different Linq solution that I think makes it more obvious what you're trying to do:

var s = from car in cars
    group car by car.Color into g
    where g.Count() == 1
    select g.First();

It's just grouping cars by color, tossing out all the groups that have more than one element, and then putting the rest into the returned IEnumerable.

Joel Coehoorn · Answer 4 · 2009-01-29T22:22:28.437

IEnumerable<Car> GetDuplicateColors(List<Car> cars)
{
    return cars.Where(c => cars.Any(c2 => c2.Color == c.Color && cars.IndexOf(c2) < cars.IndexOf(c) ) );
}

It basically means "return cars where there's any car in the list with the same color and a smaller index".

Not sure of the performance, though. I suspect an approach with a O(1) lookup for duplicates (like the dictionary/hashset method) can be faster for large sets.

Ryan · Answer 5 · 2009-01-29T22:27:23.140

Create a new Dictionary<Color, Car> foundColors and a List<Car> carsToDelete

Then you iterate through your original list of cars like so:

foreach(Car c in listOfCars)
{
    if (foundColors.containsKey(c.Color))
    {
        carsToDelete.Add(c);
    }
    else
    {
        foundColors.Add(c.Color, c);
    }
}

Then you can delete every car that's in foundColors.

You could get a minor performance boost by putting your "delete record" logic in the if statement instead of creating a new list, but the way you worded the question suggested that you needed to collect them in a List.

score 0 · Answer 6 · answered Jan 29 '09 at 22:12

Without actually coding it, how about an algorithm something like this:

iterate through your List<T> creating a Dictionary<T, int>
iterate through your Dictionary<T, int> deleting entries where the int is >1

Anything left in the Dictionary has duplicates. The second part where you actually delete is optional, of course. You can just iterate through the Dictionary and look for the >1's to take action.

EDIT: OK, I bumped up Ryan's since he actually gave you code. ;)

EnocNRoll - AnandaGopal Pardue · Answer 7 · 2009-01-30T01:56:33.543

My answer takes inspiration (in this order) from the followers respondents: Joe Coehoorn, Greg Beech and Jon Skeet.

I decided to provide a full example, with the assumption being (for real word efficiency) that you have a static list of car colors. I believe the following code illustrates a complete solution to the problem in an elegant, although not necessarily hyper-efficient, manner.

#region SearchForNonDistinctMembersInAGenericListSample
public static string[] carColors = new[]{"Red", "Blue", "Green"}; 
public static string[] carStyles = new[]{"Compact", "Sedan", "SUV", "Mini-Van", "Jeep"}; 
public class Car
{
    public Car(){}
    public string Color { get; set; }
    public string Style { get; set; }
}
public static List<Car> SearchForNonDistinctMembersInAList()
{
    // pass in cars normally, but declare here for brevity
    var cars = new List<Car>(5) { new Car(){Color=carColors[0], Style=carStyles[0]}, 
                                      new Car(){Color=carColors[1],Style=carStyles[1]},
                                      new Car(){Color=carColors[0],Style=carStyles[2]}, 
                                      new Car(){Color=carColors[2],Style=carStyles[3]}, 
                                      new Car(){Color=carColors[0],Style=carStyles[4]}};
    List<Car> carDupes = new List<Car>();

    for (int i = 0; i < carColors.Length; i++)
    {
        Func<Car,bool> dupeMatcher = c => c.Color == carColors[i];

        int count = cars.Count<Car>(dupeMatcher);

        if (count > 1) // we have duplicates
        {
            foreach (Car dupe in cars.Where<Car>(dupeMatcher).Skip<Car>(1))
            {
                carDupes.Add(dupe);
            }
        }
    }
    return carDupes;
}
#endregion

I'm going to come back through here later and compare this solution to all three of its inspirations, just to contrast the styles. It's rather interesting.

score 0 · Answer 8 · answered Jan 30 '09 at 15:02

public static IQueryable Duplicates(this IEnumerable source) where TSource : IComparable {

if (source == null)   
     throw new ArgumentNullException("source");   
 return source.Where(x => source.Count(y=>y.Equals(x)) > 1).AsQueryable<TSource>();

}

C# 3.0: Need to return duplicates from a List<>

8 Answers8

Linked