2

I've a c# Dictionary<DateTime,SomeObject> instance.

I've the following code:

private Dictionary<DateTime, SomeObject> _containedObjects = ...;//Let's imagine you have ~4000 items in it

public IEnumerable<SomeObject> GetItemsList(HashSet<DateTime> requiredTimestamps){
    //How to return the list of SomeObject contained in _containedObjects
    //Knowing that rarely(~<5% of the call), one or several DateTime of "requiredTimestamps" may not be in _containedObjects
}

I'm looking how to return an IEnumerable<SomeObject> containing all element that were referenced by one of the provided keys. The only issue is that this method will be called very often, and we might not always have every given key in parameter.

So is there something more efficient than this:

private Dictionary<DateTime, SomeObject> _containedObjects = ...;//Let's imagine you have ~4000 items in it

public IEnumerable<SomeObject> GetItemsList(HashSet<DateTime> requiredTimestamps){
    List<SomeObject> toReturn = new List<SomeObject>();
    foreach(DateTime dateTime in requiredTimestamps){
        SomeObject found;
        if(_containedObjects.TryGetValue(dateTime, out found)){
            toReturn.Add(found);
        }
    }
    return toReturn;
}
J4N
  • 19,480
  • 39
  • 187
  • 340
  • 1
    Do you always need all results in the returned `IEnumerable`? Otherwise you could use a `yield` construct to calculate the results lazily when needed. That would shave off some of the load. – Jan Thomä Apr 05 '16 at 14:38

4 Answers4

2

In general, there are two ways you can do this:

  1. Go through requiredTimestamps sequentially and look up each date/time stamp in the dictionary. Dictionary lookup is O(1), so if there are k items to look up, it will take O(k) time.
  2. Go through the dictionary sequentially and extract those with matching keys in the requiredTimestamps hash set. This will take O(n) time, where n is the number of items in the dictionary.

In theory, the first option--which is what you currently have--will be the fastest way to do it.

In practice, it's likely that the first one will be more efficient when the number of items you're looking up is less than some percentage of the total number of items in the dictionary. That is, if you're looking up 100 keys in a dictionary of a million, the first option will almost certainly be faster. If you're looking up 500,000 keys in a dictionary of a million, the second method might be faster because it's a whole lot faster to move to the next key than it is to do a lookup.

You'll probably want to optimize for the most common case, which I suspect is looking up a relatively small percentage of keys. In that case, the method you describe is almost certainly the best approach. But the only way to know for sure is to measure.

One optimization you might consider is pre-sizing the output list. That will avoid re-allocations. So when you create your toReturn list:

List<SomeObject> toReturn = new List<SomeObject>(requiredTimestamps.Count);
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
1

You can use LINQ but I doubt if it is going to increase any performance, even if there is any difference it would be negligible.

Your method could be:

public IEnumerable<SomeObject> GetItemsList(HashSet<DateTime> requiredTimestamps)
{
    return _containedObjects.Where(r => requiredTimestamps.Contains(r.Key))
                            .Select(d => d.Value);
}

One positive with this is lazy evaluation since you are not populating a list and returning it.

Habib
  • 219,104
  • 29
  • 407
  • 436
  • 1
    It's much more readable but OP is asking for better performance, I strongly doubt it's more efficient than its original version... – Adriano Repetti Apr 05 '16 at 14:40
  • @AdrianoRepetti, I agree, there shouldn't be any performance difference and even if there is, it should be negligible. – Habib Apr 05 '16 at 14:41
  • Well not so negligible (IMO), especially if requiredTimestamps is a small subset of _containedObjects but yes, for 4k objects I think it's even hard to measure this – Adriano Repetti Apr 05 '16 at 14:42
  • 1
    Well, it should perform better because in OP's code List toReturn is re-sized many times and it is quite costly performance-wise – Fabjan Apr 05 '16 at 15:00
  • I've the feeling that we don't use the dictionary for accessing values(and we access all values when the required requiredTimestamps is only a subset of all tmestamps. – J4N Apr 06 '16 at 04:44
  • @Fabjan What if I set the List with a the size of the `_containedObject`? It's not a big deal if half of the List is empty. – J4N Apr 06 '16 at 04:45
  • @J4n Yes, it should help too. – Fabjan Apr 06 '16 at 07:01
1

Method 1: To make this significantly faster - this is not by changing the algorithm but by making a local copy of _containedObjects in your method and referencing the local copy for the lookup.

Example:

public static IEnumerable<int> GetItemsList3(HashSet<DateTime> requiredTimestamps)
{
    var tmp = _containedObjects;

    List<int> toReturn = new List<int>();
    foreach (DateTime dateTime in requiredTimestamps)
    {
        int found;

        if (tmp.TryGetValue(dateTime, out found))
        {
            toReturn.Add(found);
        }
    }
    return toReturn;
}

Test data and times (on set of 5000 items with 125 keys found):
Your original method (milliseconds): 2,06032186895335
Method 1 (milliseconds): 0,53549626223609

Method 2: One way to make this marginally quicker is to iterate through the smaller set and do the lookup on the bigger set. Depending on the size difference you will gain some speed.

You are using a Dictionary and HashSet, so your lookup on either of these will be O(1).

Example: If _containedObjects has less items than requiredTimestamps we loop through _containedObjects (otherwise use your method for the converse)

public static IEnumerable<int> GetItemsList2(HashSet<DateTime> requiredTimestamps)
{
    List<int> toReturn = new List<int>();
    foreach (var dateTime in _containedObjects)
    {
        int found;

        if (requiredTimestamps.Contains(dateTime.Key))
        {
            toReturn.Add(dateTime.Value);
        }
    }
    return toReturn;
}

Test data and times (on set of 5000 for _containedObjects and set of 10000 items for requiredTimestamps with 125 keys found):
Your original method (milliseconds): 3,88056291367086
Method 2 (milliseconds): 3,31025939438943

Antony
  • 1,221
  • 1
  • 11
  • 18
  • I'm not sure to understand why copying the reference to your `var tmp` would be faster? We just copied a reference, not the whole array. (Regarding the method 2, `_containedObjects` should always be much bigger than the `requiredTimestamps` hashset – J4N Apr 12 '16 at 10:34
  • @J4N There is a difference when referencing the stack and heap - local variables/references are on the stack which makes accessing them much quicker. (For method 2: In your case you wouldn't use it then) – Antony Apr 12 '16 at 16:43
  • I didn't know that at all! And I didn't think it would have that much impact! (it has less impact when I've bigger dictionary, but in my case it helps a lot). Thank you very much.With what do you measure to have that precision? – J4N Apr 13 '16 at 06:43
  • I use a [Stopwatch](https://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396). I use the amount of ticks in the StopWatch to calculate the values. e.g. `double ticks = stopWatch.ElapsedTicks;` and then the time `double milliseconds = (ticks / Stopwatch.Frequency) * 1000;` or `double nanoseconds = (ticks / Stopwatch.Frequency) * 1000000000;`. – Antony Apr 13 '16 at 11:09
  • I did the same without the ElapsedTicks ;) Thank you – J4N Apr 13 '16 at 14:22
0

Here are some different ways to do it - performance is all pretty much the same so you can choose based on readability.

Paste this into LinqPad if you want to test it out - otherwise just harvest whatever code you need.

I think my personal favourite from a readability point of view is method 3. Method 4 is certainly readable but has the unpleasant feature that it does two lookups into the dictionary for every required timestamp.

void Main()
{
    var obj = new TestClass<string>(i => string.Format("Element {0}", i));

    var sampleDateTimes = new HashSet<DateTime>();
    for(int i = 0; i < 4000 / 20; i++)
    {
        sampleDateTimes.Add(DateTime.Today.AddDays(i * -5));
    }
    var result = obj.GetItemsList_3(sampleDateTimes);
    foreach (var item in result)
    {
        Console.WriteLine(item);
    }
}

class TestClass<SomeObject>
{
    private Dictionary<DateTime, SomeObject> _containedObjects;

    public TestClass(Func<int, SomeObject> converter)
    {
        _containedObjects = new Dictionary<DateTime, SomeObject>();
        for(int i = 0; i < 4000; i++)
        {
            _containedObjects.Add(DateTime.Today.AddDays(-i), converter(i));
        }
    }

    public IEnumerable<SomeObject> GetItemsList_1(HashSet<DateTime> requiredTimestamps)
    {
        List<SomeObject> toReturn = new List<SomeObject>();
        foreach(DateTime dateTime in requiredTimestamps)
        {
            SomeObject found;
            if(_containedObjects.TryGetValue(dateTime, out found))
            {
                toReturn.Add(found);
            }
        }
        return toReturn;
    }

    public IEnumerable<SomeObject> GetItemsList_2(HashSet<DateTime> requiredTimestamps)
    {
        foreach(DateTime dateTime in requiredTimestamps)
        {
            SomeObject found;
            if(_containedObjects.TryGetValue(dateTime, out found))
            {
                yield return found;
            }
        }
    }    

    public IEnumerable<SomeObject> GetItemsList_3(HashSet<DateTime> requiredTimestamps)
    {
        return requiredTimestamps
            .Intersect(_containedObjects.Keys)
            .Select (k => _containedObjects[k]);
    }

    public IEnumerable<SomeObject> GetItemsList_4(HashSet<DateTime> requiredTimestamps)
    {
        return requiredTimestamps
            .Where(dt => _containedObjects.ContainsKey(dt))
            .Select (dt => _containedObjects[dt]);
    }
}
Richard Irons
  • 1,433
  • 8
  • 16