4

I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).

The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.

I started with a list and identified the appropriate token object to update using a Linq query.

public class Model {
    public List<State> States { get; set; }
    ...
}

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.

Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).

So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).

So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)

If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?

To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).

A wider view of the code is...

        var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
        var tokens = tokeniser.Tokenise(contents); // Iterator
        foreach (var token in tokens) {
            var stateText = StateToString(ref state);
            var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
            // ... update the match as appropriate for the token
        }
Basic
  • 26,321
  • 24
  • 115
  • 201

2 Answers2

6
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.

Possible data structures that meet your needs:

  • Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.

  • Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • In case you're interested, I've managed to get an order of magnitude improvement in speed with your advice, using an OrderedDictionary (~9,960 token/sec). The bottleneck is now elsewhere. Thanks for taking the time. – Basic Feb 24 '16 at 02:33
  • Nice! Happy to help. – Eric Lippert Feb 24 '16 at 02:37
1

30k is not that much so if state is unique you can do something like this. Dictionary access is much faster.

var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);

Quoting MSDN:

The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.

You can find more info about Dictionaries here. Be also aware that Dictionaries use memory space to improve performance, you can do a quick test for 300k items and see what kind of space I'm talking about like this:

var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);
Diogo Cunha
  • 1,194
  • 11
  • 23
  • The problem will never be in the search time, the only problem with dictionaries is that they consume a lot of memory. They use memory space to improve performance, you can do a quick test to see the total memory that a dictionary with 300k units consumes, doing something like what I will edit in my question – Diogo Cunha Feb 23 '16 at 23:32
  • 1
    You could save a lookup by using [`TryGetValue`](https://msdn.microsoft.com/en-us/library/bb347013(v=vs.110).aspx) instead of contains + get. – Johnbot Feb 24 '16 at 10:21