63

I want to take an IEnumerable<T> and split it up into fixed-sized chunks.

I have this, but it seems inelegant due to all the list creation/copying:

private static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
{
    List<T> partition = new List<T>(partitionSize);
    foreach (T item in items)
    {
        partition.Add(item);
        if (partition.Count == partitionSize)
        {
            yield return partition;
            partition = new List<T>(partitionSize);
        }
    }
    // Cope with items.Count % partitionSize != 0
    if (partition.Count > 0) yield return partition;
}

Is there something more idiomatic?

EDIT: Although this has been marked as a duplicate of Divide array into an array of subsequence array it is not - that question deals with splitting an array, whereas this is about IEnumerable<T>. In addition that question requires that the last subsequence is padded. The two questions are closely related but aren't the same.

Community
  • 1
  • 1
Alastair Maw
  • 5,373
  • 1
  • 38
  • 50
  • 3
    Here is a [similar question with a couple of different solutions](http://stackoverflow.com/questions/3773403/linq-partition-list-into-lists-of-8-members) on Stack already. – Colin Pear Dec 04 '12 at 18:44
  • http://stackoverflow.com/questions/438188/split-a-collection-into-n-parts-with-linq – Dzmitry Martavoi Dec 04 '12 at 18:45
  • Answers are not allowed any more, but try this: http://stackoverflow.com/questions/3210824/divide-array-into-an-array-of-subsequence-array/29462069#29462069 – MBoros Apr 05 '15 at 20:50
  • 1
    here is elegant solution of lazy partitioner using Inline functions of C# 7: https://gist.github.com/pmunin/533c10f0020b21230177cfb5a2d75bb4 – Philipp Munin May 14 '19 at 00:09

8 Answers8

81

You could try to implement Batch method mentioned above on your own like this:

    static class MyLinqExtensions 
    { 
        public static IEnumerable<IEnumerable<T>> Batch<T>( 
            this IEnumerable<T> source, int batchSize) 
        { 
            using (var enumerator = source.GetEnumerator()) 
                while (enumerator.MoveNext()) 
                    yield return YieldBatchElements(enumerator, batchSize - 1); 
        } 

        private static IEnumerable<T> YieldBatchElements<T>( 
            IEnumerator<T> source, int batchSize) 
        { 
            yield return source.Current; 
            for (int i = 0; i < batchSize && source.MoveNext(); i++) 
                yield return source.Current; 
        } 
    }

I've grabbed this code from http://blogs.msdn.com/b/pfxteam/archive/2012/11/16/plinq-and-int32-maxvalue.aspx.

UPDATE: Please note, that this implementation not only lazily evaluates batches but also items inside batches, which means it will only produce correct results when batch is enumerated only after all previous batches were enumerated. For example:

public static void Main(string[] args)
{
    var xs = Enumerable.Range(1, 20);
    Print(xs.Batch(5).Skip(1)); // should skip first batch with 5 elements
}

public static void Print<T>(IEnumerable<IEnumerable<T>> batches)
{
    foreach (var batch in batches)
    {
        Console.WriteLine($"[{string.Join(", ", batch)}]");
    }
}

will output:

[2, 3, 4, 5, 6] //only first element is skipped.
[7, 8, 9, 10, 11]
[12, 13, 14, 15, 16]
[17, 18, 19, 20]

So, if you use case assumes batching when batches are sequentially evaluated, then lazy solution above will work, otherwise if you can't guarantee strictly sequential batch processing (e.g. when you want to process batches in parallel), you will probably need a solution which eagerly enumerates batch content, similar to one mentioned in the question above or in the MoreLINQ

takemyoxygen
  • 4,294
  • 22
  • 19
  • This looks very much like what I came up with. Upvoted. It looks like the code above requires that the `Enumerator` tolerates `MoveNext()` being called _twice_ in the end. When the source is exhausted (and the count isn't evenly divisible by `batchSize`), `MoveNext()` might be called once in the private helper method (where it returns `false` for the first time), and then once more in the public extension method. – Jeppe Stig Nielsen Dec 04 '12 at 20:46
  • Although people should read the comments in your link if they use this to avoid surprises if they iterate over the inside sequences multiple times, or with an .AsParallel() (which I'm not doing) – Alastair Maw Dec 05 '12 at 10:47
  • There is a bug in your code, depending on how to use, certain elements are being pulled by source.Current. – J. Lennon Mar 02 '13 at 01:14
  • 11
    It's buggy becouse too lazy :) – arkhivania Jul 24 '13 at 11:22
  • 4
    Side effects of this implementation - huge disadvantage (http://blogs.msdn.com/b/pfxteam/archive/2012/11/16/plinq-and-int32-maxvalue.aspx). For me - it produces really unexpected, invalid output. Jeppe Stig Nielsen's impl. - is the best! – SalientBrain Aug 26 '14 at 22:16
  • 8
    It is buggy. If you enumerate the second batch before the first, you get wrong results!!! – MBoros Mar 27 '15 at 12:10
  • If you will write code like array.Batch(10).ToList() - you will get list of lists containing of one null element. – Seekeer Dec 09 '15 at 14:08
  • 3
    @MBoros, it's not buggy. It's a performance-reliability trade-off. If you just need to split `IEnumerable` (hypothetically infinite!) into batches, it does the job perfectly. If you need to enumerate the top level in random order or re-iterate, you may use other implementations (e.g. http://stackoverflow.com/a/438513/947012) but they will create extra objects in memory in are not so fast. – greatvovan Feb 15 '17 at 12:01
  • @greatvovan :this has a lot of bugs, fails even the most basic one: take numbers from 1 to 20, split to batches of 10, and before seeing whats in your batches, just count them. I would have expected 2 batches, why are there 20 then? :P – MBoros Feb 16 '17 at 18:36
  • btw if you are interested in a solution that is almost good, read this blog post: http://mboros.blogspot.de/2015/04/partitioning-enumearble-into-fixed-size.html?m=1 – MBoros Feb 16 '17 at 19:02
  • @MBoros Just read my comment again, I can't explain it better than repeating it word-to-word. I use this solution in one of my projects for partitioning of items for external service and it works perfectly. It is suitable for 90% of use cases, while you are trying to find cases where it does not work (yes, it is not universal). – greatvovan Feb 17 '17 at 22:06
  • 2
    @greatvovan sorry. I got hung up on the word 'not buggy'. It is buggy as hell. But of course with some special circumstances it can deliver correct result. I would never use buggy code to count on the lucky case, but it is your personal choice. Anyways you should not advertize this solution on SO, because it is NOT CORRECT! – MBoros Feb 18 '17 at 08:43
  • 2
    @MBoros What the luck are you talking about? Is your education computer science or esoteric? It is a deterministic algorithm and it works in 100% cases on appropriate input and usage. Supposing IEnumerable represents items from some external queue. How can you count number of batches then? And what is more important – what for? The task was to split sequence into chunks, that's it. You are now introducing additional requirements and complaining that it does not work. – greatvovan Feb 22 '17 at 11:24
  • 1
    @greatvovan using NUnit the following should pass but does not. So it is buggy. If you cannot understand this, then there is not a lot we can talk about. Assert.AreEqual(2, Enumerable.Range(1, 20).Batch(10).Count(), "Wrong number of Batches"); – MBoros Feb 25 '17 at 12:39
  • 2
    @MBoros basically you failed to write correct unit test for this implementation. – greatvovan Feb 25 '17 at 18:15
  • @greatvovan so this test is not a requirement? – MBoros Feb 26 '17 at 10:44
  • @MBoros of course not, if you look at the question. – greatvovan Feb 27 '17 at 11:48
  • 3
    This implementation seems fine to me (as said by some user it's a trade off for performance). However it should throw an exception if somebody tries to enumerate a partition before previous one is fully enumerated (eg : when used in in a parallel context). I'm thinking of a protection similar to the one that exists when you try to modify a collection that is currently enumerated in a foreach loop (the exception prevents getting incorrect data). – tigrou Mar 02 '17 at 20:11
20

It feels like you want two iterator blocks ("yield return methods"). I wrote this extension method:

static class Extensions
{
  public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
  {
    return new PartitionHelper<T>(items, partitionSize);
  }

  private sealed class PartitionHelper<T> : IEnumerable<IEnumerable<T>>
  {
    readonly IEnumerable<T> items;
    readonly int partitionSize;
    bool hasMoreItems;

    internal PartitionHelper(IEnumerable<T> i, int ps)
    {
      items = i;
      partitionSize = ps;
    }

    public IEnumerator<IEnumerable<T>> GetEnumerator()
    {
      using (var enumerator = items.GetEnumerator())
      {
        hasMoreItems = enumerator.MoveNext();
        while (hasMoreItems)
          yield return GetNextBatch(enumerator).ToList();
      }
    }

    IEnumerable<T> GetNextBatch(IEnumerator<T> enumerator)
    {
      for (int i = 0; i < partitionSize; ++i)
      {
        yield return enumerator.Current;
        hasMoreItems = enumerator.MoveNext();
        if (!hasMoreItems)
          yield break;
      }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
      return GetEnumerator();      
    }
  }
}
Steven Rands
  • 5,160
  • 3
  • 27
  • 56
Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181
  • Yes, that's exactly it, although takemyoxygen's answer is a little more concise so I've accepted that one, despite your proviso about multiple calls to MoveNext(). (I think most enumerators are quite happy with that, surely?) – Alastair Maw Dec 05 '12 at 10:41
  • 2
    This is really the best solution available!!! I've tried many of them! Reasons: No side effects (see http://blogs.msdn.com/b/pfxteam/archive/2012/11/16/plinq-and-int32-maxvalue.aspx), Lazy/Streaming, Fast and memory efficient. – SalientBrain Aug 26 '14 at 22:12
  • 4
    Doing a ToList() on the returned item misses the point of the whole question... – MBoros Mar 27 '15 at 12:11
  • 2
    @SalientBrain As `ToList` is called on each batch, the memory efficiency is impacted somewhat if the batches are large. This is the best solution I've seen though. Unfortunately I don't think it's possible to have an _entirely_ streaming solution (ie. one where both the batches and the items in each batch are streamed). – Steven Rands Nov 06 '15 at 11:44
  • 2
    I realise this is 4 years old but is it appropriate to replace a good section of this implementation with LINQ's `Take` and `Skip` extension methods? – Gusdor Feb 22 '16 at 16:31
  • I personally don't like this one much - the `.ToList()` is very inefficient as at that point the batch size (although known up front!) is not taken into account, so its internal array gets expanded out log n times, with all the copying, etc. If you want something that is more resilient than the accepted answer then the MoreLinq approach is surely both simpler *and* more performant. – Alastair Maw Apr 19 '20 at 21:52
15

Maybe?

public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
{
    return items.Select((item, inx) => new { item, inx })
                .GroupBy(x => x.inx / partitionSize)
                .Select(g => g.Select(x => x.item));
}

There is an already implemented one too: morelinq's Batch.

L.B
  • 114,136
  • 19
  • 178
  • 224
  • I see that Batch basically does exactly what I do: http://code.google.com/p/morelinq/source/browse/MoreLinq/Batch.cs (only with an array internally instead of a list). – Alastair Maw Dec 04 '12 at 18:57
  • 12
    -1 as this one pulls everything into memory before returning any results and even then uses more memory by grouping things in a hashtable. – Alastair Maw Nov 22 '13 at 20:11
8

Craziest solution (with Reactive Extensions):

public static IEnumerable<IList<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
{
    return items
            .ToObservable() // Converting sequence to observable sequence
            .Buffer(partitionSize) // Splitting it on spececified "partitions"
            .ToEnumerable(); // Converting it back to ordinary sequence
}

I know that I changed signature but anyway we all know that we'll have some fixed size collection as a chunk.

BTW if you will use iterator block do not forget to split your implementation into two methods to validate arguments eagerly!

Sergey Teplyakov
  • 11,477
  • 34
  • 49
  • There's no real need to have a fixed-size collection as a chunk. Sure, it makes life easier to implement, but it's hardly a requirement. – Alastair Maw Jan 08 '16 at 14:42
  • @AlastairMaw There is a real case that we require to have a fixed-size collection. I have a query which has more than 1000 values inside "IN (..)" statement which causes this error: "ORA-01795: maximum number of expressions in a list is 1000". So I needed to partition statement into chunks each having max of 1000 items to be later merged with "OR" conditions. – ozanmut Dec 05 '17 at 12:10
5

For elegant solution, You can also have a look at MoreLinq.Batch.

It batches the source sequence into sized buckets.

Example:

int[] ints = new int[] {1,2,3,4,5,6};
var batches = ints.Batch(2); // batches -> [0] : 1,2 ; [1]:3,4 ; [2] :5,6
Tilak
  • 30,108
  • 19
  • 83
  • 131
  • As noted in the other answer that mentions this, http://code.google.com/p/morelinq/source/browse/MoreLinq/Batch.cs does exactly what I do. OK. – Alastair Maw Dec 04 '12 at 18:57
  • Yes you are right. Your code is elegant and does the same thing. I had not checked other links. I just uses this library, and thus specified here as an alternative. – Tilak Dec 04 '12 at 19:08
  • 1
    I've accepted takemyoxygen's answer as I think it's more elegant for not having the intermediate list copying. – Alastair Maw Dec 05 '12 at 10:38
1
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, 
                                                       int partitionSize)
{
    int i = 0;
    return items.GroupBy(x => i++ / partitionSize).ToArray();
}
nawfal
  • 70,104
  • 56
  • 326
  • 368
  • 6
    That will evaluate all the items before returning a result, pulling everything into memory, which somewhat defeats the purpose of using IEnumerables. If I wanted to do that I'd just pass in a List to start with and be done. – Alastair Maw Dec 04 '12 at 18:56
  • 1
    Is the `.Select(x => x)` actually necessary? – Jeppe Stig Nielsen Dec 04 '12 at 19:00
  • You need to evaluate the expression before you leave. Else it yields in wrong results – nawfal Dec 06 '12 at 19:51
0

How about the partitioner classes in the System.Collections.Concurrent namespace?

Christoffer
  • 12,712
  • 7
  • 37
  • 53
  • Maybe I'm being dumb, but the example given here http://msdn.microsoft.com/en-us/library/dd381768.aspx seems truly enormous for such a simple task. How would this actually work in a way that's more elegant than what I already have? – Alastair Maw Dec 05 '12 at 10:35
0

You can do this using an overload of Enumerable.GroupBy and taking advantage of integer division.

return items.Select((element, index) => new { Element = element, Index = index })
    .GroupBy(obj => obj.Index / partitionSize, (_, partition) => partition);
Adam Maras
  • 26,269
  • 6
  • 65
  • 91
  • 1
    it's good, but you have to write `(_, partition) => partition.Select(x => x.element)` instead of `(_, partition) => partition` – Roman Pekar Dec 05 '12 at 06:10
  • 1
    This is rather inefficient - it has to pull the whole `IEnumerable` into memory (assuming it's of finite length to begin with), and will probably use a rather wasteful hashtable to do the grouping. – Alastair Maw Nov 22 '13 at 20:09