0

I would like to group a very large sequence lazily using code like the following:

// native F# version
let groups =
    Seq.initInfinite id
        |> Seq.groupBy (fun i -> i % 10)
for (i, group) in groups |> Seq.take 5 do
    printfn "%A: %A" i (group |> Seq.take 5)

Expected output is:

1: seq [1; 11; 21; 31; ...]
2: seq [2; 12; 22; 32; ...]
3: seq [3; 13; 23; 33; ...]
4: seq [4; 14; 24; 34; ...]
5: seq [5; 15; 25; 35; ...]

However, in practice, this program loops infinitely, printing nothing. Is it possible to accomplish this in F#?

I'd be willing to use Linq instead of native functions, but both GroupBy and ToLookup produce the same behavior (even though Linq's GroupBy is supposed to be lazy):

// Linq version
let groups =
    Enumerable.GroupBy(
        Seq.initInfinite id,
        (fun i -> i % 10))
for group in groups |> Seq.take 5 do
    printfn "%A" (group |> Seq.take 5)

Perhaps I'm doing something unintentionally that causes eager evaluation?

Brian Berns
  • 15,499
  • 2
  • 30
  • 40
  • No, it's not you, it's the methods implementation. They are not truly lazy - they will not start evaluation unless you need results, but all the results wil be generated at once ones that happens. But that's because scenario like yours is super uncommon. – MarcinJuraszek May 09 '15 at 02:03
  • OK, well, that's disappointing. Thanks. – Brian Berns May 09 '15 at 02:12

2 Answers2

2

There are two things to say:

First of all how do you know, how many groups there will be in an infinite sequence? With other words, how many items do you need to materialize to get your 5 groups from above? How many would you need to materialize, if you asked for 11 groups? Conceptually, it is not even easy explain informally, what should happen when you group lazily.

Secondly, the Rx version of group by is lazy and is probably as close as you can get to what you want: http://rxwiki.wikidot.com/101samples#toc24 This version of group by works, because it reacts on each element and fires the appropriate group as such, you get an event when a new element is consumed and you get the information in which group it occurred, as opposed to getting a list of groups.

Daniel Fabian
  • 3,828
  • 2
  • 19
  • 28
  • Why would I need to know how many groups there will be "up front", since the groups themselves are returned in a lazy sequence? If the 5th group isn't generated until the trillionth element in the sequence, then obviously "take 5" groups is going to be very expensive. Nonetheless, "take n" at the group level should return the first N groups generated by the sequence, just as "take n" at the element level returns the first N elements generated by the sequence. In what way are group-level sequences so different from element-level sequences? – Brian Berns May 09 '15 at 07:59
  • (Also, practically speaking, I actually do know exactly how many groups there will be in this case, since the grouping function is the mod operator. But even so, the groupBy function is too eager because there is no way for me to convey this information during the call. Maybe there's a reasonable solution to the problem given a finite, fully-specified set of groups up front? Signature would be something like "groupByLazy fixedArrayOfGroups groupingFunction inputSequence".) – Brian Berns May 09 '15 at 08:08
  • Well the problem is this: Say you enumerate the 2nd group. In order to get to the first element of that group, you need to enumerate some amount of elements, right? Now what would you do with them? Cache? Ignore them? – Daniel Fabian May 09 '15 at 08:14
  • In principle what you need to do, is something like a Seq.unzip – Daniel Fabian May 09 '15 at 08:16
  • @DanielFabian: Those elements would be cached. Isn't that how groupBy works today? I should be able to enumerate the second group as far as I want, then go "back" and enumerate the first group without losing any elements. The amount of memory consumed by the implementation would go up and down as various groups are enumerated, but I still don't see why there's a need to enumerate the entire underlying sequence in order to build the implementation. – Brian Berns May 09 '15 at 15:02
  • @brianberns I was more like trying to show what the problems behind a truly lazy group by are. Like exactly the one, that you only ever know how many groups you'll end up with, when you enumerated all of the elements. So enumerating all the groups for instance basically requires full iteration, etc. The behaviour would not be easy to predict and very data dependent. In particular, you might explode the memory upon accessing a certain (later) group, etc. Therefore, I think the trade-off was made in favour of the current implementation. – Daniel Fabian May 09 '15 at 15:18
  • @DanielFabian: Yes, enumerating all the groups would require full iteration of the underlying sequence, but that's exactly what the existing groupBy does today, so I don't see it as a new problem. Yes, behavior will be data dependent, memory usage could explode, etc., but again that's no different from what we have today with groupBy. In other words, the proposed behavior is no worse than the existing groupBy, and potentially much better in some cases. I don't see this as a "trade-off" at all, since there's no down-side to what I'm suggesting. Maybe I'll see if I can implement it myself... – Brian Berns May 09 '15 at 15:57
  • I agree, your behaviour would not be worse. But less deterministic. So the trade-off is between suboptimal but deterministic and "better" but harder to reason about. I personally see merit in both possible implementations. You'd opt for better for some inputs, but sometimes bad. And FSharp.Core went with worse in average but easy to predict. – Daniel Fabian May 09 '15 at 16:02
  • Yes, I suppose. I think they should change the signature of the existing groupBy function though, since seq implies laziness. It's really seq<_>[], not seq>. – Brian Berns May 09 '15 at 16:04
  • Agreed, they did that with Async.parallel – Daniel Fabian May 09 '15 at 16:15
1

My Hopac library for F# has an implementation of so called choice streams (presentation), which are both lazy and concurrent/asynchronous and also provide a groupBy operation.