5

I have a question about the efficiency of Skip() and Take() when used with IEnumerable<>.

I am returning all my data lists with IEnumerable<> and i use 'yield return' to prevent me from having to allocate large amounts of memory to pass back the data. This works very efficiently.

However, later in my process I wanted to batch this data and take blocks of say 20 entries from my list at a time. I thought to myself.. ah! This fits an enumerator perfectly.

I discovered the very useful Skip() and Take() methods on the IEnumerable interface however I'm now realising that this causes my Loop to re-interate from the beginning each time.

What is the best way of paging data from an IEnumerable? Am I better off using MoveFirst() and MoveNext() on the enumerator instead of Skip() and Take()?

I've done some googling but can't find an answer..

Can anyone help?

I really love LINQ functionality on IEnumerable<> however I really have to take efficiency into consideration.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
user3328317
  • 51
  • 1
  • 3

2 Answers2

9

You can write a Batch method to transform a sequence of items into a sequence of batches of a given size, which can be done without needing to iterate the source sequence multiple times, and which can limit the memory footprint to only holding the size of one batch in memory at once:

public static IEnumerable<IEnumerable<T>> Batch<T>(
    this IEnumerable<T> source, int batchSize)
{
    List<T> buffer = new List<T>(batchSize);

    foreach (T item in source)
    {
        buffer.Add(item);

        if (buffer.Count >= batchSize)
        {
            yield return buffer;
            buffer = new List<T>(batchSize);
        }
    }
    if (buffer.Count > 0)
    {
        yield return buffer;
    }
}
Servy
  • 202,030
  • 26
  • 332
  • 449
  • 1
    In the context of paging which seems to be what this question is about: Doing `source.Batch(pageSize).Skip(pageNo).Take(1)` is almost the same as `source.Skip(pageNo*pageSize).Take(pageSize).ToList()` except the former does more unnecessary allocations of the skipped pages. – Martin Liversage Feb 19 '14 at 15:38
  • @MartinLiversage Where does the OP state that he's only looking to take out one page at a time? The whole point here is that you wouldn't just pull out one page, then go back, batch the items again, and pull out another page. You'd have something like `foreach(var batch in data.Batch(batchSize))processBatch(batch);` – Servy Feb 19 '14 at 15:41
  • 1
    _I discovered the very useful Skip() and Take() methods on the IEnumerable interface however I'm now realising that this causes my Loop to re-interate from the beginning each time._ indicates to me that he is doing `source.Skip(n).Take(m)` and has discovered that each time he does that the iteration starts over from the first element. My point is that either you have to start over or you have to store the items "skipped" so that they can be reused next time `Skip` is called on the source. There is no magic solution that neither costs CPU nor memory. – Martin Liversage Feb 19 '14 at 15:46
  • @MartinLiversage But there is. If he is creating a loop that looks something like: `for(int i = 0; i < numBatches; i++) processBatch(data.Skip(batchSize*i).Take(batchSize));`, which is what it sounds like he's doing, then he could replace that with my code, iterate the source sequence exactly once, never have more than one batch in memory, and not any significant unnecessary CPU work. This is the appropriate solution to that problem. – Servy Feb 19 '14 at 15:52
0

There will always be a tradeoff between memory and CPU. Currently, you are getting the items for a page by moving forward until the start of the page using Skip and the items will be recomputed by the iterator block on each page request.

However, you can avoid the recomputation by caching the items computed so far but this will use some memory. You state that you decided to use an iterator block to avoid using too much memory but perhaps a "smart" solution that only caches the necessary items can be useful?

In the answers to the Stack Overflow question Is there an IEnumerable implementation that only iterates over it's source (e.g. LINQ) once you will find some solutions that only computes and store enough elements to be able to move to your page. E.g. if your page size is 10 and you want page 5 you will only compute and store the first 60 items. And a subsequent request for page 3 will use the already computed items while a request for page 10 will compute and cache enough items to get the data for that page.

If you want to perform paging without starting from the first element and also without unnecessary storing unused items you need some way to restart the iteration at a particular page without having to iterate all the previous elements. IEnumerable<T> and IEnumerator<T> does not provide enough functionality to do that.

Community
  • 1
  • 1
Martin Liversage
  • 104,481
  • 22
  • 209
  • 256
  • As he stated in the question, the purpose of using an iterator block here is to reduce the memory footprint because storing the entire data set in memory all at once is too much. Caching the entire data set in memory is thus defeating that purpose. – Servy Feb 19 '14 at 14:33
  • @Servy: The answers to the linked question (one which you have provided yourself) does not cache the entire sequence but only the minimum amount of items generated to move to a particular page. This solves the problem stated in the question: _ I'm now realising that this causes my Loop to re-interate from the beginning each time_. However, obviously the cost is that items now are cached. As long as you use an iterator block you cannot avoid this tradeoff. – Martin Liversage Feb 19 '14 at 15:35
  • It doesn't cache each item until you reach it, but the point remains that the memory footprint of the method is O(n) where "n" is the number of items that have been iterated so far. That means that if you iterate the whole thing, you've now put the entire contents of the sequence in memory, all at once. If you end up iterating all/most of the sequence in the majority of cases then you've really gained nothing over just calling `ToList`. The advantage if you are 1) iterating it from multiple threads/sources concurrently 2) frequently iterating only a small portion of it. – Servy Feb 19 '14 at 15:37