Fast random access to a collection

Question

I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).

The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.

I started with a list and identified the appropriate token object to update using a Linq query.

public class Model {
    public List<State> States { get; set; }
    ...
}

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.

Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).

So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).

So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)

If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?

To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).

A wider view of the code is...

        var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
        var tokens = tokeniser.Tokenise(contents); // Iterator
        foreach (var token in tokens) {
            var stateText = StateToString(ref state);
            var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
            // ... update the match as appropriate for the token
        }

So where is your HashSet implementation which this question is actually about? — GolezTrol, Feb 23 '16 at 23:15
You benefit from using `HashSet` if you actually want to perform set operations on instances of `T`, but not when you have to loop through all the items to check if a certain property of T-instances have a certain value. You should use a dictionary where you use the `Condition` property as the key if you want fast look-ups. — Mark, Feb 23 '16 at 23:16
And how do you use it? Please add that code to the question itself. — GolezTrol, Feb 23 '16 at 23:21
Hash sets do not do binary search. They use a hash algorithm. Hence the name. — Eric Lippert, Feb 23 '16 at 23:25
@EricLippert Edited to add more context. The comment re: Binary search was based on another SO question which I can't find on short notice (apologies, comments coming in faster than I can respond) — Basic, Feb 23 '16 at 23:25

score 6 · Accepted Answer · answered Feb 23 '16 at 23:19

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.

Possible data structures that meet your needs:

Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.
Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.

In case you're interested, I've managed to get an order of magnitude improvement in speed with your advice, using an OrderedDictionary (~9,960 token/sec). The bottleneck is now elsewhere. Thanks for taking the time. — Basic, Feb 24 '16 at 02:33

Diogo Cunha · Answer 2 · 2016-02-23T23:39:51.693

30k is not that much so if state is unique you can do something like this. Dictionary access is much faster.

var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);

Quoting MSDN:

The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.

You can find more info about Dictionaries here. Be also aware that Dictionaries use memory space to improve performance, you can do a quick test for 300k items and see what kind of space I'm talking about like this:

var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);

The problem will never be in the search time, the only problem with dictionaries is that they consume a lot of memory. They use memory space to improve performance, you can do a quick test to see the total memory that a dictionary with 300k units consumes, doing something like what I will edit in my question — Diogo Cunha, Feb 23 '16 at 23:32
You could save a lookup by using [`TryGetValue`](https://msdn.microsoft.com/en-us/library/bb347013(v=vs.110).aspx) instead of contains + get. — Johnbot, Feb 24 '16 at 10:21

Fast random access to a collection

2 Answers2