-1

Im looking for another solution with LINQ preferable for:

List<int> distinctAges = new List<int>();
for (int indexInt = 0; indexInt < ages.Count; indexInt++)
    if (!distinctAges.Contains(ages[indexInt]))
        distinctAges.Add(ages[indexInt]);
// 69.98 ms

vs

List<int> distinctAges = new List<int>();
foreach(int singleAge in ages)
    if (!distinctAges.Contains(singleAge))
        distinctAges.Add(singleAge);
// 94.89 ms

another solution:

List<int> distinctAges = ages.GroupBy(singleNumber => singleNumber).Select(singleGroup => singleGroup.Key).ToList();
// 293.22 ms

or:

List<int> distinctAges = ages.Distinct().ToList();
// 103.16 ms

Shown time is from higher loop and result must to be in the List. I'm searching for a solution, which is not done by for/foreach and execution time is similar to for/foreach. Any idea about it ?

Kopco
  • 9
  • 1
  • 13
    *Why* do you need a solution besides for/foreach? They do what you want, and have the execution time you're looking for. – Broots Waymb Sep 21 '20 at 12:55
  • yes, I know, that is working for me, but I would like to learn something new in another way – Kopco Sep 21 '20 at 12:58
  • 8
    Have you tried `var distinctAges = new HashSet(ages);` or `var distinctAges = ages.ToHashSet();`? You don't need a list do you? – Aluan Haddad Sep 21 '20 at 12:58
  • 4
    Honestly I'd stick with your `ages.Distinct().ToList()` attempt. It's not slow, and the intent is very clear. – Code Stranger Sep 21 '20 at 13:01
  • 2
    You have a collection of items and you need to look at each item to determine which ones are unique. The only way to do that is by iterating over ever item. Even the Linq solutions are iterationg over them under the covers. The only other thing you could do is just get the `IEnumerator` and do a `while` loop. – juharr Sep 21 '20 at 13:01
  • thanks, I that HashSet sound nice and its different. so its enought only to use ages.ToHashSet(); – Kopco Sep 22 '20 at 08:00
  • 1
    Based on @AluanHaddad and my tests, the range of values versus count of `ages` makes a significant difference - if there are lots of collisions (duplicate values) in `ages`, then `for`-`Contains`-`HashSet.Add`-`ToList` is significantly faster, but if the values are mostly distinct, then `ToHashSet().ToList()` is fastest. – NetMage Sep 24 '20 at 18:18

3 Answers3

1

In general, LINQ is not faster than explicit looping, particularly indexed looping with for which the C# compiler can optimize.

My timings are different from yours in that I get for 10 million ages in a list, LINQ Distinct is faster than your for and foreach loops because List<T>.Contains is not as fast as HashSet<T>.Contains, which Distinct is based on.

If you can use Parallel, then AsParallel().Distinct() is often faster, and not significantly slower than HashSet.

So, fastest non-parallel:

var hs = new HashSet<int>();
for (int j1 = 0; j1 < ages.Count; ++j1) {
    if (!hs.Contains(ages[j1]))
        hs.Add(ages[j1]);
}
var ans = hs.ToList();

Note: Testing with Contains on the HashSet<T> is marginally faster than without, thought I don't think it should be. Seems like there may be an optimization possible there.

NetMage
  • 26,163
  • 3
  • 34
  • 55
  • Once the number of elements gets up into the thousands, set based approaches are roughly two orders of magnitude faster than list based ones. Actually, `Distinct` doesn't use a `HashSet` but a special internal set and distinct enuerator. But if you want a set as the result, just calling `ToHashSet` is faster than a manual loop that adds to a `HashSet` and it does use a `HashSet` directly – Aluan Haddad Sep 21 '20 at 23:51
  • @AluanHaddad Actually, it is not, because `ToHashSet` takes an `IEnumerable` and passes it to the `HashSet` constructor which has no optimizations for a `List`, so the `for`/`Contains` is much faster. In my testing, `for`-`Contains`-`Add` is 1.6x faster than `ToHashSet`. (PS I know `Distinct` uses an optimized set, but that seemed unnecessary technical detail.) – NetMage Sep 23 '20 at 00:22
  • In .NET 5-rc, the results are otherwise. – Aluan Haddad Sep 23 '20 at 15:21
  • @AluanHaddad I find that surprising given the code appears the same as .Net Core 3.1 and the overhead of what it is doing seems like it should be significant (`foreach` `AddIfNotPresent`). I'll see if I can test with .NET 5 RC1. – NetMage Sep 23 '20 at 17:32
  • I was testing with 10,000 pseudo randomly generated ints if that helps. – Aluan Haddad Sep 23 '20 at 22:02
  • @AluanHaddad I am testing with 10,000,000 random ints from 1 to 105 (ages). – NetMage Sep 23 '20 at 22:46
  • I tested with pseudo randoms in [1, 105] and my results were concordant with yours. My original test used pseudo randoms in [0, 2147483647). The number of collisions accounts for the difference I expect. – Aluan Haddad Sep 24 '20 at 01:53
0

this the first solution

    List<int> distinctAges = new List<int>() { 1, 2, 3, 4, 5, 6 };
    List<int> ages = new List<int>() { 1, 3, 7, 8, 9, 10 };

    distinctAges.AddRange(ages.Where(e => !distinctAges.Contains(e)).ToList());

second solution

    HashSet<int> distinctAges = new HashSet<int>() { 1, 2, 3, 4, 5, 6 };
    List<int> ages = new List<int>() { 1, 3, 7, 8, 9, 10 };

    distinctAges.UnionWith(ages);
Issa Saman
  • 100
  • 9
  • Thanks for it, but I do not know, what numbers can be in the first distinctAges. They can be from 0 (new item) to really old item like 2000. so if i would like to use your solution, then it will be necessary to create that distinctAges at start. – Kopco Sep 22 '20 at 07:53
  • may not have items at all , i give you an example with already distinctAges has data but you can fill it will 2,3 or four lists – Issa Saman Sep 23 '20 at 17:10
-2

You can use Parallel.For or Parallel.ForEach loops. These are the multi-threads versions of their respective loops. For example, with a list of ages with 1 000 000 items, it took 77 ms with a for loop, 55 ms with a foreach loop, 24 ms with a Parallel.For loop, and 15 ms with a Parallel.ForEach loop.

LINQ also has a parallel alternative called PLINQ, but here results can vary because it won't always execute in parallel. In the same test as above it took 50 ms for the PLINQ query and 44 ms for the traditional LINQ query so in this case it was worse than the traditional LINQ query.

Here is the Parallel loops:

using System.Threading.Tasks;
//...

Parallel.For(0, ages.Count, indexInt =>
        {
            if (!distinct1.Contains(ages[(int)indexInt]))
                distinct1.Add(ages[(int)indexInt]);

        });

Parallel.ForEach(ages, singleAge =>
        {
            if (!distinct2.Contains(singleAge))
                distinct2.Add(singleAge);
        });

Their signatures are Parallel.For(int fromInclusive, int toExclusive, Action<int> body) and Parallel.ForEach<TSource>(IEnumerable<TSource> source, Action<TSource> body)

And here is the PLINQ example:

using System.Linq;
//...

List<int> distinct = ages.AsParallel().Distinct().ToList();

However, you must take care in using the Parallel loops as these use multiple threads. But these are faster than for and foreach loops.

Astro Mec
  • 61
  • 1
  • 4
  • "But these are faster than for and foreach loops" - not necessarily. See https://stackoverflow.com/questions/6036120/parallel-foreach-slower-than-foreach. Multitasking is not come without a cost, which becomes more apparent with simpler tasks. – Broots Waymb Sep 21 '20 at 13:57
  • 2
    The above examples require a concurrent collection and the use of a method that simultaniously checks and adds the value if needed (`TryAdd`). Right now multiple threads can perform checks before another thread is able to add the value which could cause duplicates. – NotFound Sep 21 '20 at 14:13
  • 1
    `if (!distinct1.Contains(...)) distinct1.Add` <== this is probably not thread-safe, and certainly not atomic. You are not supposed to do things like this in a parallel loop. – Theodor Zoulias Sep 21 '20 at 14:13
  • 2
    The PLINQ implementation is threadsafe, the others are not. – Aluan Haddad Sep 21 '20 at 14:30
  • I tried your solutions, but for was 20x faster. This lines are only part of the whole code and I want to change actually only parts. Rest is still same in my executions. – Kopco Sep 22 '20 at 07:28