7

I was looking at the source code of Batch method and I have seen this:

// Select is necessary so bucket contents are streamed too
yield return resultSelector(bucket.Select(x => x));

There is a comment which I didn't quite understand. I have tested this method without using Select and it worked well. But it seems there is something I'm missing.I can't think of any example where this would be necessary, So what's the actual purpose of using Select(x => x) here ?

Here is the full source code for reference:

private static IEnumerable<TResult> BatchImpl<TSource, TResult>(
        this IEnumerable<TSource> source,
        int size,
        Func<IEnumerable<TSource>, TResult> resultSelector)
    {
        TSource[] bucket = null;
        var count = 0;

        foreach (var item in source)
        {
            if (bucket == null)
                bucket = new TSource[size];

            bucket[count++] = item;

            // The bucket is fully buffered before it's yielded
            if (count != size)
                continue;

            // Select is necessary so bucket contents are streamed too
            yield return resultSelector(bucket.Select(x => x));

            bucket = null;
            count = 0;
        }

        // Return the last bucket with all remaining elements
        if (bucket != null && count > 0)
            yield return resultSelector(bucket.Take(count));
    }
BartoszKP
  • 34,786
  • 15
  • 102
  • 130
Selman Genç
  • 100,147
  • 13
  • 119
  • 184
  • Delayed execution. The iterator isn't accessed until it's enumerated over. – Brad Christie Dec 11 '14 at 13:42
  • 7
    That was *probably* my code to start with. Hmm. Usually I would do this to prevent a result selector from casting the source to an array and then messing with it, but in this case that wouldn't actually be harmful. – Jon Skeet Dec 11 '14 at 13:42
  • 1
    @BradChristie It will be delayed also without the `Select` here (whatever you mean). – BartoszKP Dec 11 '14 at 13:45
  • It looks like this is so iteration over the bucket contents is deferred, e.g. the enumerable is only enumerated during iteration, though the code which sets up the buckets needs to iterate the enumerable in the first place to bucket them. What does `resultSelector` look like? Does it wrap the enumerable in another enumerator? (Oh my bad didn't see it in the signature, this is what happens when you go on SO on your mobile) – Charleh Dec 11 '14 at 13:47
  • @Charleh What is deferred in LINQ is *execution*, not iteration. In this case, the array has already been created with its contents calculated, so there is nothing to defer any more. – BartoszKP Dec 11 '14 at 13:53
  • @JonSkeet so can I assume this is redundant? or are there cases where this migth be necessary or useful? – Selman Genç Dec 11 '14 at 13:54
  • 1
    @JonSkeet Perhaps return type consistency? Even though casting to an array wouldn't be harmful, it would work only partially, as for the last bucket, the cast would fail. – BartoszKP Dec 11 '14 at 13:56
  • 2
    @Selman22 I think what Jon Skeet is saying is that it is used to convert the mutable array to an immutable `IEnumerable` to ensure that `resultSelector()` (which is passed in from outside) cannot muck with it. The reason it's unnecessary here is that `bucket` is immediately set to `null` right after that `yield return`. – JLRishe Dec 11 '14 at 13:56
  • @Selman22: I can't think of any way of abusing it right now, but that's not to say that there isn't one... – Jon Skeet Dec 11 '14 at 13:58
  • @JonSkeet A result selector with side effects is abuse in and of itself, in my opinion. At least this forces it to be written correctly, which can only be a good thing, particularly if the same function is reused elsewhere... – nmclean Dec 11 '14 at 16:20

1 Answers1

5

To sum up what's in the comments, theoretically this is redundant. Deferred execution is irrelevant in this case. At the point of yield full execution has already been made: contents of bucket are already calculated and there is nothing to defer.

There is also no problem caused by the iterator block behaviour - each time we're back in this implementation the bucket is being reset and recreated (bucket = null immediately after yield). Even if someone would cast the result to the array type and modify it, we don't care.

An advantage of this approach seems to be only elegance: there is a type consistency between all calls to resultSelector. Without the "redundant" Select, the actual type would have been TSource[] most of the time, and IEnumerable<TSource> for the trailing elements that did not fill the whole bucket.

However, one can imagine the following scenario:

  1. someone using this function notices that the actual type is an array
  2. because of some urge need to improve performance, they cast the received batch to TSource[] (e.g. they can now skip elements more efficiently, as Skip is not optimized for arrays)
  3. they use the method without any problems, because it happens that Count() % size == 0 in their case

Until, later, it happens that one additional elements pops in, causing the last yield to be executed. And now the cast to TSource[] will fail.

So, depending on the number of elements and size the method would behave inconsistently with regard to its result type (passed to the given callback). One can imagine other elaborate scenarios where this inconsistency can cause trouble, like some ORM that, depending on the actual type, serializes objects into different tables. In this context pieces of data would end up in different tables.

These scenarios are of course all based on some other mistake being made, and do not prove that without the Select the implementation is wrong. It is however more friendly with the Select, in a sense, that it reduces the number of such unfortunate scenarios to a minimum.

Community
  • 1
  • 1
BartoszKP
  • 34,786
  • 15
  • 102
  • 130