2

I was reading this question about Lists and deferred execution.

SLaks mentioned you can replace the ToList() in this code with a Join() instead and it'd be much faster.

How can I change this code to use IEnumerable.Join()?

var dic = new Dictionary<int, string>();

for (int i = 0; i < 20000; i++)
{
    dic.Add(i, i.ToString());
}

var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key).ToList(); 

var list2 = dic.Where(f => list.Contains(f.Key)).ToList();
Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
David Klempfner
  • 8,700
  • 20
  • 73
  • 153
  • What he wrote is that instead of iterating over both lists to find matches, `Join` should be used to match records and still get deferred execution. Your code's problem is that it's using a `Dictionary` in the first place though. Iterating over a dictionary is a *lot* slower than iterating over an array/List. Searching is slower too, unless you use exact key lookups – Panagiotis Kanavos Jun 04 '21 at 10:47
  • 3
    What's the point of this code anyway? You've already found the matching items in the first query. `list2` will contain the `KeyValuePair` items found by the first query. `dic.Where(f => f.Value.StartsWith("1")).ToList();` would be enough – Panagiotis Kanavos Jun 04 '21 at 10:49
  • @PanagiotisKanavos - I think the point is to demonstrate the execution inefficiency of a double query. It's not about optimisation. – Enigmativity Jun 04 '21 at 10:56
  • 1
    @Enigmativity If it were about that, then a better example would have been far more appropriate. I'm leaning towards Panagiotis' opinion as being more relevant. – DavidG Jun 04 '21 at 10:58
  • @DavidG - Check the links in the question. The OP got the code from there. – Enigmativity Jun 04 '21 at 11:00
  • The example is bad. `Join` is faster because it builds a `HashSet` internally from one sequence and uses it to find matches from the other. That's what a Dictionary is for! – Panagiotis Kanavos Jun 04 '21 at 11:02
  • Yes, agreed the example is bad. – Enigmativity Jun 04 '21 at 11:04

3 Answers3

1

You can change the code as shown below

var list = dic.Where(f => f.Value.StartsWith("1")).ToList(); 
var list2 = dic.Join(list, d=> d.Key, a => a.Key, (d, a) => new {d, a}).Select(m => m.d).ToList();

Yes, the Join would be faster compared to Contains.

To validate run the following program which returns the output as

  Test1- Item count: 1111
  Time to execute: 100542
  Test2- Item count: 1111
  Time to execute: 31441

Program:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
                    
public class Program
{
    
    private static void Benchmark(Action act, int iterations)
    {
        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            act.Invoke();
        }
        sw.Stop();
        Console.WriteLine("Time to execute: " + (sw.ElapsedTicks).ToString());
    }
    
    public static void Test1()
    {
        var dic = new Dictionary<int, string>();
        for(int i=0; i<2000; i++)
        {
            dic.Add(i, i.ToString());
        }

        var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key).ToList(); 
        var list2 = dic.Where(f => list.Contains(f.Key)).ToList();
        
        Console.WriteLine("Test1- Item count: "+ list2.Count());
    }
    
    public static void Test2()
    {
        var dic = new Dictionary<int, string>();
        for(int i=0; i<2000; i++)
        {
            dic.Add(i, i.ToString());
        }
    
        var list = dic.Where(f => f.Value.StartsWith("1")).ToList(); 
        var list2 = dic.Join(list, d=> d.Key, a => a.Key, (d, a) => new {d, a}).Select(m => m.d).ToList();
        Console.WriteLine("Test2- Item count: "+ list2.Count());
    }

    public static void Main(string[] args)
    {
        Benchmark(()=> Test1(), 1);
        Benchmark(() => Test2(), 1);    
    }
}
user1672994
  • 10,509
  • 1
  • 19
  • 32
0

The join is much faster because it will internally create a HashSet to perform the lookup, whereas the list.Contains forces an inefficient iteration to find a match.

The current code:

var sw = Stopwatch.StartNew();

var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key).ToList();

var list2 = dic.Where(f => list.Contains(f.Key)).ToList();

Console.WriteLine(sw.Elapsed.TotalMilliseconds);

Outputs about 32.4966.

Whereas this:

var sw = Stopwatch.StartNew();

var query =
    from f in dic
    where f.Value.StartsWith("1")
    join x in dic on f.Key equals x.Key
    select x;

var list2 = query.ToList();

Console.WriteLine(sw.Elapsed.TotalMilliseconds);

Outputs about 2.4568.

You still need a .ToList() to cause the query to execute.

The first approach has a O(n^2) execution and the second a O(n).

Enigmativity
  • 113,464
  • 11
  • 89
  • 172
  • 1
    The second query returns the same KVP items the first one did. `dic.Where(f => f.Value.StartsWith("1")).ToList()` would be a lot faster – Panagiotis Kanavos Jun 04 '21 at 10:51
  • 1
    @PanagiotisKanavos - Yes, but that's not the point of the question. The OP is purposely doing a double query. – Enigmativity Jun 04 '21 at 10:55
  • Besides, Join is faster because it builds a HashSet from one of the enumerables internally to avoid iterations for lookups. That's what the dictionary would do if used correctly – Panagiotis Kanavos Jun 04 '21 at 10:55
  • @PanagiotisKanavos - Yes, I think that is the point. – Enigmativity Jun 04 '21 at 10:57
  • `that's not the point of the question.` I suspect this is an XY question. Why iterate a dictionary at all? Why do it *twice*? It's strange asking how to speed things with `Join` when those iterations made the code far slower than it needs to be. – Panagiotis Kanavos Jun 04 '21 at 10:58
  • @PanagiotisKanavos - If you check the links in the question the OP shows where he got the original code from. This is an academic exercise. – Enigmativity Jun 04 '21 at 11:00
-1

It is simple. But I am not sure if it will be faster it needs to be tested,

var dic = new Dictionary<int, string>();

for (int i = 0; i < 20000; i++)
{
    dic.Add(i, i.ToString());
}

var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key); //.ToList();

var list2 = (from item in dic
            join listItem in list
                on item.Key equals listItem
            select item).ToList();

Here I have run profiler using the code below:

using System;
using System.Collections.Generic;
using System.Linq;

class Program
{
    const int ResultLength = 11111;
    const int NElements = 20000;
    static readonly int[] RandomArrayOrderCache = RandomArrayOrder();
    static readonly Dictionary<int, string> dic = Enumerable.Range(0, NElements).ToDictionary(k => k, v => v.ToString());
    static readonly KeyValuePair<int, string>[] UnorderedList =
        dic.OrderBy((k) => RandomArrayOrderCache[k.Key]).ToArray();

    static int[] RandomArrayOrder()
    {
        var result = Enumerable.Range(0, NElements).ToArray();
        var random = new Random();
        for (int i = 0; i <= 10000000; i++)
        {
            var i1 = random.Next(0, result.Length - 1);
            var i2 = random.Next(0, result.Length - 1);
            var tmp = result[i1];
            result[i1] = result[i2];
            result[i2] = tmp;
        }

        return result;
    }

    public static List<KeyValuePair<int, string>> TestWithoutToListUsingJoin()
    {
        var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key);// .ToList();
        return (from item in dic
                join listItem in list
                    on item.Key equals listItem
                select item).ToList();
    }

    public static List<KeyValuePair<int, string>> TestWithToListUsingJoin()
    {
        var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key).ToList();
        return (from item in dic
                join listItem in list
                    on item.Key equals listItem
                select item).ToList();
    }

    public static List<KeyValuePair<int, string>> TestWithToListUsingJoinButRandomizeElementOrder()
    {
        var list = UnorderedList.Where(f => f.Value.StartsWith("1")).Select(f => f.Key).ToList();
        return (from item in dic
                join listItem in list
                    on item.Key equals listItem
                select item).ToList();
    }

    public static List<KeyValuePair<int, string>> TestWithoutToListUsingJoinButRandomizeElementOrder()
    {
        var list = UnorderedList.Where(f => f.Value.StartsWith("1")).Select(f => f.Key);
        return (from item in dic
                join listItem in list
                    on item.Key equals listItem
                select item).ToList();
    }

    public static List<KeyValuePair<int, string>> TestWithToListOldWay()
    {
        var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key).ToList();
        return dic.Where(f => list.Contains(f.Key)).ToList();
    }

    public static List<KeyValuePair<int, string>> TestWithoutToListOldWay()
    {
        var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Key);// .ToList();
        return dic.Where(f => list.Contains(f.Key)).ToList();
    }


    public static List<KeyValuePair<int, string>> TestWithoutToListIfItWasUsedWithAReferenceType()
    {
        var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Value);// .ToList();
        return (from item in dic
                join listItem in list
                    on item.Value equals listItem
                select item).ToList();
    }

    public static List<KeyValuePair<int, string>> TestWithToListIfItWasUsedWithAReferenceType()
    {
        var list = dic.Where(f => f.Value.StartsWith("1")).Select(f => f.Value).ToList();
        return (from item in dic
                join listItem in list
                    on item.Value equals listItem
                select item).ToList();
    }

    public static void Main(string[] args)
    {
        long dummy = 0;
        long numberOfRetries = 1000;
        for (int i = 0; i < numberOfRetries; ++i)
        {
            dummy += TestWithoutToListUsingJoin().Count;
            dummy += TestWithToListUsingJoin().Count;
            dummy += TestWithToListOldWay().Count;
            // This takes ridicilous amount of time so I am disabling it from this test to compare others better
            //dummy += TestWithoutOldWay().Count;

            // Added these two to compare the heap vs stack caching effects on times
            dummy += TestWithoutToListIfItWasUsedWithAReferenceType().Count;
            dummy += TestWithToListIfItWasUsedWithAReferenceType().Count;

            dummy += TestWithToListUsingJoinButRandomizeElementOrder().Count;
            dummy += TestWithoutToListUsingJoinButRandomizeElementOrder().Count;
        }
        Environment.Exit(dummy == ResultLength * 7 * numberOfRetries ? 0 : 1);
    }
}

Here are the results:

Profiling results

From the results we can see the reason why. Because the first list where you are using on the where clause is a struct list.So it benefits greatly from the CPU cache. But when we use a reference type where the instances spread across the heap then we see the benfit diminishes. Same result we can see on randomly spread keys too. The performance impact you gain using ToList is ordered values in contiguous memory hitting CPU cache mostly,

E_net4
  • 27,810
  • 13
  • 101
  • 139
Abdurrahim
  • 2,078
  • 1
  • 17
  • 23
  • So why not test it then? – Enigmativity Jun 04 '21 at 10:57
  • I do not understand why I get downvote. This discourages to take my time to answer to the questions. Please whoever downvoted state your reason so I can understand and not repeat my mistake again! – Abdurrahim Jun 04 '21 at 11:14
  • I'm afraid you haven't tested what you thought you have tested. The join in both methods eliminates the inefficiency that the OP is asking about. You need to test with the original code in one method and a join in the other. – Enigmativity Jun 04 '21 at 11:34
  • Even with Join tolist makes it faster due to caching I believe but fine I will run with old methods again as well. I believe ToList in general makes the memory contigious so the caching is much more effective especially when List is a struct list. If the first "list" was list of strings or something of a reference type then it maybe much more different story. Anyhow, It is running now I will update the results, – Abdurrahim Jun 04 '21 at 11:38
  • No, the code with `.ToList()` is slower. I refactored your code so that it wasn't building the dictionary each time and the code with `.ToList()` was taking 2% longer than without. – Enigmativity Jun 04 '21 at 12:03
  • @Enigmativity please share the refactored code so I can see the mistakes I am making, thank you, – Abdurrahim Jun 04 '21 at 12:22
  • "The performance impact you gain using ToList is ordered values in contiguous memory hitting CPU cache mostly," - No, that's not true at all. `ToList()` does not do any of that. – Enigmativity Jun 04 '21 at 21:54
  • You've got loads of issues going on in your code. By doing `random.Next(0, result.Length - 1)` you are never swapping the last element of your array. You are also performing 10_000_000 swaps on a `20_000` element array - why? - just iterate through the array and swap each element once with a random element. – Enigmativity Jun 04 '21 at 22:00
  • well what is that todo with the performance counters i am testing it runs at the start only once right and i am only comparing cpu time of the methods and skipping one element should be no big deal there. Also how tolist does not generate contiguous memory. dotnet is opensource just open sourcecode of the list and see it uses array for item storing, – Abdurrahim Jun 04 '21 at 22:10
  • also feel free to correct these you can edit the answer – Abdurrahim Jun 04 '21 at 22:12
  • `List` is backed by an array which is contiguous, but my point is that contiguous memory is not providing any performance improvement. – Enigmativity Jun 04 '21 at 22:22
  • I used LINQPad so the `.Dump()` at the end is to display the results. You could easily write to the Console. – Enigmativity Jun 04 '21 at 23:15
  • The final result is that the old way without the `.ToList()` is slow slow slow. Then the old way with `.ToList()` much much faster. Then all the joins pretty much come in twice as fast again, but there is a bias towards the ordered list being slightly faster - which is because it is slower to build a `HashSet` on an unordered list and nothing to do with contiguous memory. There is negligible difference on value or reference types nor if there is a `.ToList()` on the join queries. – Enigmativity Jun 04 '21 at 23:18