Java grouping in stream

Question

Java 8 streams allow us to collect elements while grouping by an arbitrary constraint. For example:

Map<Type, List<MyThing>> grouped = stream
  .collect(groupingBy(myThing -> myThing.type()));

However this has the drawback that the stream must be completely read through, so there is no chance of lazy evaluation of future operations on grouped.

Is there a way to do a grouping operation to get something like Stream<Tuple<Type, Stream<MyThing>>>? Is it even conceptually possible to group lazily in any language without evaluating the whole data set?

No, this doesn't really make sense. Give up. – Louis Wasserman Aug 29 '16 at 23:22 — Louis Wasserman, Aug 29 '16 at 23:22

score 4 · Accepted Answer · edited Aug 29 '16 at 23:07

4

The concept of lazy grouping doesn't really make sense. Grouping, by definition, means selecting groups in advance to avoid the overhead of searching through all the elements for each key. "Lazy grouping" would look like this:

List<MyThing> get(Type key) {
    source.stream()
            .filter(myThing -> myThing.type().equals(key))
            .collect(toList());
}

If you prefer to defer iteration to when you know you need it, or if you want to avoid the memory overhead of caching a grouping map, this is perfectly fine. But you can't optimize the selection process without iterating ahead of time.

edited Aug 29 '16 at 23:07

kag0

5,624
7
34
67

answered Aug 29 '16 at 23:01

shmosel

49,289
6
73
138

My thought was that the overhead of a grouping map wouldn't be so bad as long as the number of grouping keys aren't prohibitively large, and the grouping values are also streams. – kag0 Aug 29 '16 at 23:14
3

@kag0 Well, sure you can create a map of keys to streams, but either the streams will point to the source collection, which leaves you with virtually no performance enhancement, or the streams will point to a grouped collection (e.g. `groupingBy(MyThing::type, collectingAndThen(toList(), List::stream()))`), but that leaves you with no memory reduction. – shmosel Aug 29 '16 at 23:20
2

@kag0 On a side note, there's not much memory overhead in grouping to begin with, since it's only creating a shallow copy of your data. It'll likely cost slightly over `n * 4` bytes, where `n` is the size of your source collection. – shmosel Aug 29 '16 at 23:25
If the streams point to `Spliterator`s from queues or something wouldn't that give the performance enhancement of being able to parallelize the grouping operation and operations on already grouped elements? Then the grouped and processed elements could be released from memory before later elements are even loaded. – kag0 Aug 30 '16 at 01:49
@kag0: whatever you actually want to say with these “`Spliterator`s from queues or something”, the answer is *no*. As shmosel already hinted at, the data structure, `groupingBy` produces by default, currently a `HashMap` from key to `ArrayList`, is close to the optimum, even for subsequent streaming operations on the map values, including parallel processing. You are wasting efforts trying to improve something that doesn’t need improvements. – Holger Aug 30 '16 at 12:38

score 1 · Answer 2 · answered Aug 29 '16 at 22:33

1

A stream should be operated on (invoking an intermediate or terminal stream operation) only once. This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream.

Taken from the doc at:

https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html

So i think there is no way to split it without consuming it and creating new streams.

answered Aug 29 '16 at 22:33

mzl

183
1
9

This is another way to state my question. `Stream.collect` is a terminal operation, and the only way to use `Collectors.groupingBy`. I would like to know if there is a non-terminal way to group in streams. – kag0 Aug 29 '16 at 22:56

score 0 · Answer 3 · answered Aug 29 '16 at 22:26

0

I do not think that this would make sense since reading from one partition stream (Tuple<Type, Stream<MyThing>>) of a lazy stream Stream<Tuple<Type, Stream<MyThing>>> could produce an arbitrarily large amount of consumed memory in the other partitions.

E.g. consider the lazy stream of positive integers in natural order and group them by their smallest prime factor. Then reading from the last received element of the stream of partitions would produce an ever increasing number of integers in the streams received before.

answered Aug 29 '16 at 22:26

aventurin

2,056
4
26
30

This is true, however I think that not be unexpected. Attempting that operation using `collect(groupingBy())` would produce the same result (never complete due to memory deprivation). Performing any operation on an infinite stream will either require an infinite amount of memory or time. – kag0 Aug 29 '16 at 23:03

score 0 · Answer 4 · answered Aug 29 '16 at 23:24

Is it even conceptually possible to group lazily in any language without evaluating the whole data set?

No, you cannot group an entire data set correctly without checking the entire data set or having a guarantee of an exploitable pattern in the data. For example, I can group the first 10,000 integers into even-odd lazily, but I can't lazily group even-odd for a random set of 10,000 integers.

As far as grouping in a non-terminal fashion... it's not something that seems like a good idea. Conceptually, a grouping function on a stream should return multiple streams, as if it were branching the different streams, and Java 8 does not support that.

If you really want to use native Stream methods to group non-terminally, you could abuse the sorted method. Give it a sorter that treats the groups differently but treats all elements within a group as equal and you'll end up with group1,group2,group3,etc. This won't give you lazy evaluation, but it is grouping.

Not sure I follow the even odd example. Can you explain why that is? — kag0, Aug 29 '16 at 23:34
It's possible to lazily group the first 10,000 integers into evens and odds because you can predict values with 100% accuracy. You could lazily group any sequential set of integers based on whether or not they are even or odd, since even-odd is a predictable pattern. But, if I just have 10,000 random integers, I lose the ability to predict data, so I have to check all of them if I want to correctly group. — Jeutnarg, Aug 30 '16 at 15:28

Java grouping in stream

4 Answers4