2

How do I make ToLookup() concurrent? I have a some code like this:

myRepository.GetAllContacts().ToLookup( c => c.COMPANY_ID);

I would like to have a structure similar to this:

new ConcurrentDicitonary<String,IEnumerable<Contact>>();  // <Company,List of Contacts in company>

Because each COMPANY_ID could map to multiple Contact, I can't simply use ToDictionary.

Hamid Pourjam
  • 20,441
  • 9
  • 58
  • 74
Bill Software Engineer
  • 7,362
  • 23
  • 91
  • 174
  • Why concurrent? Are you hoping for a speed improvement? – Enigmativity Jan 10 '16 at 07:51
  • Yes, I would like to concurrently do some work on all Companies at the same time. – Bill Software Engineer Jan 10 '16 at 11:03
  • The only way that concurrency is going to give you any speed improvement is if the time taken to compute `c.COMPANY_ID` is lengthy. Since it is a property is probably lightning fast so the current code will perform far better than using any form of multi-threading. It just takes time to launch threads and to do all of the marshalling. – Enigmativity Jan 10 '16 at 11:07

1 Answers1

4

This seems like a simple question, but as the other (problematic) answers show, the solution is not really trivial.

Problems with existing answers

As it stands, both proposed solutions result in the creation of some kind of dictionary where each time you enumerate the IEnumerable<Contact> at any given key, the filtered IEnumerable<Contact> is recreated from scratch by enumerating and filtering the original collection. Essentially, what you're storing in the dictionary is the logic to get the desired filtered Contact collections, not the actual collections.

As a result you'll be enumerating the original IEnumerable<Contact> over and over. This is dangerous from a thread-safety standpoint and even if it works - there is no benefit to doing this, only overheads.

Proposed solution

You are right in that the best out-of-the-box thread-safe alternative for Lookup/ILookup<TKey, TValue> appears to be ConcurrentDictionary<TKey, TValue> where TValue derives from IEnumerable<Contact>. It offers a superset of lookup functionality and is thread-safe if you build it correctly. There is no ready-to-use extension method for this in the Base Class Library, so you can just roll your own implementation:

IEnumerable<Contact> contacts = GetAllContacts();
ConcurrentDictionary<string, IReadOnlyList<Contact>> dict = new ConcurrentDictionary<string, IReadOnlyList<Contact>>();

foreach (IGrouping<string, Contact> group in contacts.GroupBy(c => c.COMPANY_ID))
{
    if (!dict.TryAdd(group.Key, group.ToArray())) {
        throw new InvalidOperationException("Key already added.");
    }
}

This looks very similar to what others have offered, but with one important difference: my dictionary's TValue is a materialised collection (specifically Contact[] posing as IReadOnlyList<Contact>). It does not get rebuilt from scratch every time you pull it out of the dictionary and enumerate it.

Oh, and also I only enumerate the source IEnumerable<Contact> once, ever - not really life-changing, but a nice touch.

You could still use ConcurrentDictionary<string, IEnumerable<Contact>> as your dictionary type (you could substitute the dictionary type in my example above and it will still compile and work as expected) - just be sure that you only add materialised and, preferably, immutable collections to the dictionary as you're building it.

Choosing your TValue type: alternatives to IReadOnlyList<T>

(beyond the scope of the original question)

IReadOnlyList<T> is the most generic general-purpose quasi-immutable collection interface I could think of (apart from IReadOnlyCollection<T> obviously) that conveys to the callers that the collection is materialised and unlikely to change in the future.

If I were using this in my own code I would actually use Contact[] as my dictionary's TValue for any private and internal calls (forgoing the comfort of "read-only" for perf reasons). For any public APIs I would stick with IReadOnlyList<T> or possibly ReadOnlyCollection<T> to emphasise the read-only aspect of the TValue collection.

If taking an external dependency is a viable option, you could also add Microsoft's System.Collections.Immutable NuGet to your project and use an ImmutableDictionary<string, ImmutableArray<Contact>> to store your lookup. ImmutableDictionary<TKey, TValue> is an immutable thread-safe dictionary. ImmutableArray<T> is a lightweight array wrapper which has strong immutability guarantees and also has solid performance characteristics achieved through a struct enumerator and re-implementations of certain LINQ methods which avoid enumerator allocations altogether.

List<T> would be a poor choice for TValue due to it's a) mutability and b) its tendency to allocate internal buffer arrays whose length is greater than List<T>.Count (unless you explicitly use List<T>.TrimExcess). When you stash stuff in a dictionary, there's a good chance it'll stay alive for a while, so allocating memory you're not really going to use (like List<T> does) is not a great idea.

EDIT

Now after all this I have to add: .NET's current Lookup<Tkey, TValue> implementation returned by LINQ's ToLookup actually appears to be thread-safe. However, none of the specs that I found make any guarantees about the thread-safety of instance methods on Lookup<TKey, TValue> (MSDN specifically states that they are not guaranteed to be thread-safe), which means that lookup thread-safety is an implementation detail, not a bulletproof guarantee. therefore all I said above re using ConcurrentDictionary<TKey, TValue> still applies.

Kirill Shlenskiy
  • 9,367
  • 27
  • 39
  • VERY interesting reply. I am gonna have to try this next Monday at work. – Bill Software Engineer Jan 10 '16 at 11:04
  • +1 but you need to measure if this is faster. I'd say 95% chance of it being significantly slower. The internal coordination in the CD probably kills the parallelization gains. – usr Jan 10 '16 at 13:45
  • Also, the GroupBy must be parallelized as well in order to have even a chance of speeding this up. Otherwise you build a parallel dict *after building a sequential dict*. – usr Jan 10 '16 at 13:45
  • @usr, there is some ambiguity in the question leading us to focus on two different concepts: you're thinking about *building* the lookup in parallel, whereas my focus was on creating a thread-safe lookup which would subsequently be *used* by parallel consumers. Of course creating a `ConcurrentDictionary` will take longer - addressing that was never part of my goal due to the way I read the question. – Kirill Shlenskiy Jan 10 '16 at 13:53
  • I didn't really think of this ambiguity until I saw @Enigmativity's comment (which makes total sense and I, too, don't see much benefit in parallellising the lookup creation). – Kirill Shlenskiy Jan 10 '16 at 14:02
  • What ToLookup returns is already thread-safe for reading. To me it seems this is about creating the lookup. – usr Jan 10 '16 at 14:20
  • I didn't know that the result of ToLookUp was already thread-safe. In anycase, I tried it, it works, but it's actually slower than single-threaded. So I had to shelf the changes. – Bill Software Engineer Jan 11 '16 at 19:49
  • @YongkeBillYu, so was usr right that you were trying to *build* the lookup in parallel as opposed to using it in parallel once it's ready? – Kirill Shlenskiy Jan 11 '16 at 21:35
  • No, I was actually just trying to USE it in parallel. The building part didn't take that long so I didn't bother with it. It was the using part that was taking the majority of run time. – Bill Software Engineer Jan 11 '16 at 21:55
  • @YongkeBillYu, at least I got my assumptions right then. `ConcurrentDictionary` is pretty quick, so I'm very surprised re your perf results - they suggest that you're using the lookup it in a way I didn't anticipate. I'd be curious to see the parallel consumer code (if you are allowed to post it and think there's benefit to doing so, of course). – Kirill Shlenskiy Jan 11 '16 at 23:09
  • There isn't any point in posting the code, but to put it simply, I was just mapping each Company to a export view object stored in a list, then converting the list to CSV, then export out. Clearly, this can be done in parallel as they don't depend on each other, but I guess each one simply isn't taking long enough to justify the cost of parallelism. – Bill Software Engineer Jan 11 '16 at 23:24
  • @YongkeBillYu, your perf could be limited by IO of the export stage (in which case you'll get no benefit in parallelisation), but if the CPU work component is non-negligible, you could benefit from a `Parallel.ForEach` with a limited `DegreeOfParallelism` over your `ILookup`. This does not require concurrent access to the lookup though, so my above answer kinda goes out the window. Matter of fact in this situation I wouldn't even use a lookup - `Parallel.ForEach(contacts.GroupBy(c => c.COMPANY_ID), ...)` will get the job done - if there's benefit to parallelisation it at all of course. – Kirill Shlenskiy Jan 11 '16 at 23:38