24

Provided items is the result of a LINQ expression:

var items = from item in ItemsSource.RetrieveItems()
            where ...

Suppose generation of each item takes some non-negligeble time.

Two modes of operation are possible:

  1. Using foreach would allow to start working with items in the beginning of the collection much sooner than whose in the end become available. However if we wanted to later process the same collection again, we'll have to copy save it:

    var storedItems = new List<Item>();
    foreach(var item in items)
    {
        Process(item);
        storedItems.Add(item);
    }
    
    // Later
    foreach(var item in storedItems)
    {
        ProcessMore(item);
    }
    

    Because if we'd just made foreach(... in items) then ItemsSource.RetrieveItems() would get called again.

  2. We could use .ToList() right upfront, but that would force us wait for the last item to be retrieved before we could start processing the first one.

Question: Is there an IEnumerable implementation that would iterate first time like regular LINQ query result, but would materialize in process so that second foreach would iterate over stored values?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
zzandy
  • 2,263
  • 1
  • 23
  • 43
  • How hard could it be to write a CachingEnumerable/CachingEnumerator implementation that takes in the original IEnumerable and the enumerator would cycle over the cache and then pull additional values from the original until done, caching as it goes thru it? But no, I'm not aware of any framework implementation that does this. – Rich Sep 14 '12 at 15:05
  • @Rich: Probably not too hard, just wanted to check if there is one already. – zzandy Sep 14 '12 at 15:07
  • Well, the problem is that `foreach` actually works with it's own `IEnumerator` and that `IEnumerator` has it's own state. Sure, you could wrap an `IQueryable` or an `IEnumerable` in something that caches; but, you'd have to deal with the possibility of two `IEnumerator`s enumerating concurrently and at different rates. – Peter Ritchie Sep 14 '12 at 21:18
  • This is the typical case where I would have stopped using LINQ in favor of a standard loop :) The challenge is very insterresting anyway. – Larry Sep 17 '12 at 09:39
  • Possible duplicate of [Caching IEnumerable](http://stackoverflow.com/questions/1537043/caching-ienumerable) – hazzik Jan 07 '16 at 07:40

4 Answers4

13

A fun challenge so I have to provide my own solution. So fun in fact that my solution now is in version 3. Version 2 was a simplification I made based on feedback from Servy. I then realized that my solution had huge drawback. If the first enumeration of the cached enumerable didn't complete no caching would be done. Many LINQ extensions like First and Take will only enumerate enough of the enumerable to get the job done and I had to update to version 3 to make this work with caching.

The question is about subsequent enumerations of the enumerable which does not involve concurrent access. Nevertheless I have decided to make my solution thread safe. It adds some complexity and a bit of overhead but should allow the solution to be used in all scenarios.

public static class EnumerableExtensions {

  public static IEnumerable<T> Cached<T>(this IEnumerable<T> source) {
    if (source == null)
      throw new ArgumentNullException("source");
    return new CachedEnumerable<T>(source);
  }

}

class CachedEnumerable<T> : IEnumerable<T> {

  readonly Object gate = new Object();

  readonly IEnumerable<T> source;

  readonly List<T> cache = new List<T>();

  IEnumerator<T> enumerator;

  bool isCacheComplete;

  public CachedEnumerable(IEnumerable<T> source) {
    this.source = source;
  }

  public IEnumerator<T> GetEnumerator() {
    lock (this.gate) {
      if (this.isCacheComplete)
        return this.cache.GetEnumerator();
      if (this.enumerator == null)
        this.enumerator = source.GetEnumerator();
    }
    return GetCacheBuildingEnumerator();
  }

  public IEnumerator<T> GetCacheBuildingEnumerator() {
    var index = 0;
    T item;
    while (TryGetItem(index, out item)) {
      yield return item;
      index += 1;
    }
  }

  bool TryGetItem(Int32 index, out T item) {
    lock (this.gate) {
      if (!IsItemInCache(index)) {
        // The iteration may have completed while waiting for the lock.
        if (this.isCacheComplete) {
          item = default(T);
          return false;
        }
        if (!this.enumerator.MoveNext()) {
          item = default(T);
          this.isCacheComplete = true;
          this.enumerator.Dispose();
          return false;
        }
        this.cache.Add(this.enumerator.Current);
      }
      item = this.cache[index];
      return true;
    }
  }

  bool IsItemInCache(Int32 index) {
    return index < this.cache.Count;
  }

  IEnumerator IEnumerable.GetEnumerator() {
    return GetEnumerator();
  }

}

The extension is used like this (sequence is an IEnumerable<T>):

var cachedSequence = sequence.Cached();

// Pulling 2 items from the sequence.
foreach (var item in cachedSequence.Take(2))
  // ...

// Pulling 2 items from the cache and the rest from the source.
foreach (var item in cachedSequence)
  // ...

// Pulling all items from the cache.
foreach (var item in cachedSequence)
  // ...

There is slight leak if only part of the enumerable is enumerated (e.g. cachedSequence.Take(2).ToList(). The enumerator that is used by ToList will be disposed but the underlying source enumerator is not disposed. This is because the first 2 items are cached and the source enumerator is kept alive should requests for subsequent items be made. In that case the source enumerator is only cleaned up when eligigble for garbage Collection (which will be the same time as the possibly large cache).

Martin Liversage
  • 104,481
  • 22
  • 209
  • 256
  • This would be much shorter/simpler if you used an iterator block to implement the `IEnumerator`. It would get rid of a lot of that boilerplate code. – Servy Sep 14 '12 at 17:57
  • @Servy: I have updated the code based on your input and I think it is a nice simplification. – Martin Liversage Sep 14 '12 at 19:46
  • It looks much nicer. Now you just need to allow for multithreaded (cached) iteration like my answer ;) – Servy Sep 14 '12 at 19:51
  • @Servy: As I see it multi-threading is a completely different problem which requires a different solution (which you seem to have provided). My solution solves the problem where you want to call `ToList` to avoid reiterating the enumerable but you still want the laziness of `IEnumerable`. – Martin Liversage Sep 14 '12 at 19:56
  • Actually, no. My code, in a previous iteration, did pretty much the same thing that yours did, I just added onto it. If you look at my solution and ignore the `ensureItemAt` method it is structured pretty close to the same way. As for the problems they solve, they both solve the same problem, it's just that yours adds the additional restriction of "you must iterate the entire enumeration in a single thread before any cached value will be used". Mine will re-used cached values even if the entire enumeration isn't iterated or if there is concurrent iteration. – Servy Sep 14 '12 at 20:00
  • 2
    Abandoning `IDisposable` objects is icky, though I guess since there's no telling whether there are ever going to be future calls to `GetEnumerator` there's probably no good way to know when the enumerator may be safely disposed. Too bad there's no concept of a disposable enumerable. – supercat Sep 14 '12 at 21:55
  • One point is that - it calls both `MoveNext()` and `Current()` methods of source `IEnumerable` even if only `MoveNext()` called. But all yield implemented IEnumerable's internally save result for every `MoveNext`. Although I can imagine situations and realizations, when source `IEnumerable.Current` do additional work (Lazy principle) and `MoveNext()` is fast - in that case better to cache only when `Current` called. – Кое Кто Oct 23 '18 at 11:45
8

Take a look at the Reactive Extentsions library - there is a MemoizeAll() extension which will cache the items in your IEnumerable once they're accessed, and store them for future accesses.

See this blog post by Bart De Smet for a good read on MemoizeAll and other Rx methods.

Edit: This is actually found in the separate Interactive Extensions package now - available from NuGet or Microsoft Download.

goric
  • 11,491
  • 7
  • 53
  • 69
  • Thanks, that's yet another reference to Rx I'm getting this week. Will need time to digest. – zzandy Sep 14 '12 at 15:26
4
public static IEnumerable<T> SingleEnumeration<T>(this IEnumerable<T> source)
{
    return new SingleEnumerator<T>(source);
}

private class SingleEnumerator<T> : IEnumerable<T>
{
    private CacheEntry<T> cacheEntry;
    public SingleEnumerator(IEnumerable<T> sequence)
    {
        cacheEntry = new CacheEntry<T>(sequence.GetEnumerator());
    }

    public IEnumerator<T> GetEnumerator()
    {
        if (cacheEntry.FullyPopulated)
        {
            return cacheEntry.CachedValues.GetEnumerator();
        }
        else
        {
            return iterateSequence<T>(cacheEntry).GetEnumerator();
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return this.GetEnumerator();
    }
}

private static IEnumerable<T> iterateSequence<T>(CacheEntry<T> entry)
{
    using (var iterator = entry.CachedValues.GetEnumerator())
    {
        int i = 0;
        while (entry.ensureItemAt(i) && iterator.MoveNext())
        {
            yield return iterator.Current;
            i++;
        }
    }
}

private class CacheEntry<T>
{
    public bool FullyPopulated { get; private set; }
    public ConcurrentQueue<T> CachedValues { get; private set; }

    private static object key = new object();
    private IEnumerator<T> sequence;

    public CacheEntry(IEnumerator<T> sequence)
    {
        this.sequence = sequence;
        CachedValues = new ConcurrentQueue<T>();
    }

    /// <summary>
    /// Ensure that the cache has an item a the provided index.  If not, take an item from the 
    /// input sequence and move to the cache.
    /// 
    /// The method is thread safe.
    /// </summary>
    /// <returns>True if the cache already had enough items or 
    /// an item was moved to the cache, 
    /// false if there were no more items in the sequence.</returns>
    public bool ensureItemAt(int index)
    {
        //if the cache already has the items we don't need to lock to know we 
        //can get it
        if (index < CachedValues.Count)
            return true;
        //if we're done there's no race conditions hwere either
        if (FullyPopulated)
            return false;

        lock (key)
        {
            //re-check the early-exit conditions in case they changed while we were
            //waiting on the lock.

            //we already have the cached item
            if (index < CachedValues.Count)
                return true;
            //we don't have the cached item and there are no uncached items
            if (FullyPopulated)
                return false;

            //we actually need to get the next item from the sequence.
            if (sequence.MoveNext())
            {
                CachedValues.Enqueue(sequence.Current);
                return true;
            }
            else
            {
                FullyPopulated = true;
                return false;
            }
        }
    }
}

So this has been edited (substantially) to support multithreaded access. Several threads can ask for items, and on an item by item basis, they will be cached. It doesn't need to wait for the entire sequence to be iterated for it to return cached values. Below is a sample program that demonstrates this:

private static IEnumerable<int> interestingIntGenertionMethod(int maxValue)
{
    for (int i = 0; i < maxValue; i++)
    {
        Thread.Sleep(1000);
        Console.WriteLine("actually generating value: {0}", i);
        yield return i;
    }
}

public static void Main(string[] args)
{
    IEnumerable<int> sequence = interestingIntGenertionMethod(10)
        .SingleEnumeration();

    int numThreads = 3;
    for (int i = 0; i < numThreads; i++)
    {
        int taskID = i;
        Task.Factory.StartNew(() =>
        {
            foreach (int value in sequence)
            {
                Console.WriteLine("Task: {0} Value:{1}",
                    taskID, value);
            }
        });
    }

    Console.WriteLine("Press any key to exit...");
    Console.ReadKey(true);
}

You really need to see it run to understand the power here. As soon as a single thread forces the next actual values to be generated all of the remaining threads can immediately print that generated value, but they will all be waiting if there are no uncached values for that thread to print. (Obviously thread/threadpool scheduling may result in one task taking longer to print it's value than needed.)

Servy
  • 202,030
  • 26
  • 332
  • 449
  • 1
    The method requires that the first enumeration is full and complete before the result is cached. And ideally on the subsequent enumerations you could return an `IList<>` to take advantage of Linq optimizations. – Greg Sep 14 '12 at 15:25
  • @Greg As to your first point, that's intentional (just move the `cache.Add` to before the `foreach` to change that). I wouldn't want to cache half of the sequence, have another thread return a half-completed sequence, and then have the first thread finish off the cache entry later. As to the second point, yes, I could. It would involve re-factoring into two methods (you can have regular returns and yield returns in the same method) and I wanted to keep it simpler. If you refactor the `else` into a method then the `if` can return a `List`. – Servy Sep 14 '12 at 15:31
  • @zzandy Completely re-written such that you no longer need to provide a key if you don't want. There is a new wrapper that will allow an enumerable that is iterated several times to use the cached value for all subsequent iterations. If you only ever want to use that then make the `CachedSequence` method `private`. – Servy Sep 14 '12 at 15:47
  • @Greg I've included the optimization in my edit, since I needed to re-work the solution anyway and it wasn't going to be simple anymore no matter what. – Servy Sep 14 '12 at 15:48
  • @Servy the initial question does not specify you need to deal with multi threading, so I think you idea is ok but could/should be simplified. If you want to deal with multi threading then your code is still not complete, if the enumeration is slow and you have several threads starting at the same time on the same variable, none of them would find the collection in cache and all of them would therefore enumerate the source. – Wasp Sep 14 '12 at 16:16
  • @Wasp Take a look at it now. I spent some time making it work with multitheraded generation. – Servy Sep 14 '12 at 17:49
  • This isn't thread-safe: `iterateSequence` accesses `CachedValues` without the lock. Possible error: if CachedValues is being resized in the critical section, then while that resize it's contents are undefined. However, the chance of you hitting that error condition are slim. Similarly, since `CachedValues.Count` is a property, not a field, you cannot safely access it outside of the critical section either (at least, not without knowign the implementation of `List` - in practice it probably is safe). – Eamon Nerbonne Nov 11 '13 at 12:39
  • @EamonNerbonne To the first point, the list is never being iterated using that method unless it is already fully populated, so the race condition doesn't exist. That method is unchecked specifically because it's only used when the entire sequence has been cached. To the second point, that's probably right. You can remove the optimization of accessing it outside the lock if it makes you uncomfortable, or check the source code of `List` to verify that it's safe for whatever version you're using. – Servy Nov 11 '13 at 14:58
  • It's been a while since you wrote the code, but take a look at the if-guard: if it's *not* fully popuplated, you call `iterateSequence` - that's the method that calls `ensureItemAt` to thread-safely ensure a particular item is loaded. But then, *if* the item is loaded, it goes directly to `CachedValues` - and that's not safe. – Eamon Nerbonne Nov 11 '13 at 19:36
  • @EamonNerbonne Ah, yes, I see what you mean now. – Servy Nov 11 '13 at 20:04
  • @EamonNerbonne Thanks; the recent edit should fix it, and with a minimal amount of delay for readers. – Servy Nov 11 '13 at 20:11
  • In any case, you'd use something like this (i.e. in parallel) when your processing takes long, and then it's probably worth a little overhead. – Eamon Nerbonne Nov 11 '13 at 20:42
  • @EamonNerbonne Agreed. I suppose the alternative would be to refactor the code to use a `ConcurrentQueue`. In fact, that might be worth doing. – Servy Nov 11 '13 at 20:43
  • I'd be curious about the perf difference, if you do :-) – Eamon Nerbonne Nov 11 '13 at 21:23
0

There have already been posted thread-safe implementations of the Cached/SingleEnumeration operator by Martin Liversage and Servy respectively, and the thread-safe Memoise operator from the System.Interactive package is also available. In case thread-safety is not a requirement, and paying the cost of thread-synchronization is undesirable, there are answers offering unsynchronized ToCachedEnumerable implementations in this question. All these implementations have in common that they are based on custom types. My challenge was to write a similar not-synchronized operator in a single self-contained extension method (no strings attached). Here is my implementation:

public static IEnumerable<T> MemoiseNotSynchronized<T>(this IEnumerable<T> source)
{
    // Argument validation omitted
    IEnumerator<T> enumerator = null;
    List<T> buffer = null;
    return Implementation();

    IEnumerable<T> Implementation()
    {
        if (buffer != null && enumerator == null)
        {
            // The source has been fully enumerated
            foreach (var item in buffer) yield return item;
            yield break;
        }

        enumerator ??= source.GetEnumerator();
        buffer ??= new();
        for (int i = 0; ; i = checked(i + 1))
        {
            if (i < buffer.Count)
            {
                yield return buffer[i];
            }
            else if (enumerator.MoveNext())
            {
                Debug.Assert(buffer.Count == i);
                var current = enumerator.Current;
                buffer.Add(current);
                yield return current;
            }
            else
            {
                enumerator.Dispose(); enumerator = null;
                yield break;
            }
        }
    }
}

Usage example:

IEnumerable<Point> points = GetPointsFromDB().MemoiseNotSynchronized();
// Enumerate the 'points' any number of times, on a single thread.
// The data will be fetched from the DB only once.
// The connection with the DB will open when the 'points' is enumerated
// for the first time, partially or fully.
// The connection will stay open until the 'points' is enumerated fully
// for the first time.

Testing the MemoiseNotSynchronized operator on Fiddle.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104