195

I have a data set represented by a Java 8 stream:

Stream<T> stream = ...;

I can see how to filter it to get a random subset - for example

Random r = new Random();
PrimitiveIterator.OfInt coin = r.ints(0, 2).iterator();   
Stream<T> heads = stream.filter((x) -> (coin.nextInt() == 0));

I can also see how I could reduce this stream to get, for example, two lists representing two random halves of the data set, and then turn those back into streams. But, is there a direct way to generate two streams from the initial one? Something like

(heads, tails) = stream.[some kind of split based on filter]

Thanks for any insight.

Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
user1148758
  • 1,963
  • 2
  • 12
  • 6
  • 2
    Mark's answer is much helpful than Louis's answer but I must say Louis's is more related to the original question. The question is rather focused on the possibility to convert `Stream` to multiple `Stream`s **without intermediate conversion**, though I think people who reached this question are actually looking the way to achieve so regardless of such constraint, which is Mark's answer. This may due to the fact that **the question in the title is not as same as that in the description**. – devildelta Jan 17 '20 at 07:56

11 Answers11

383

A collector can be used for this.

  • For two categories, use Collectors.partitioningBy() factory.

This will create a Map<Boolean, List>, and put items in one or the other list based on a Predicate.

Note: Since the stream needs to be consumed whole, this can't work on infinite streams. And because the stream is consumed anyway, this method simply puts them in Lists instead of making a new stream-with-memory. You can always stream those lists if you require streams as output.

Also, no need for the iterator, not even in the heads-only example you provided.

  • Binary splitting looks like this:
Random r = new Random();

Map<Boolean, List<String>> groups = stream
    .collect(Collectors.partitioningBy(x -> r.nextBoolean()));

System.out.println(groups.get(false).size());
System.out.println(groups.get(true).size());
  • For more categories, use a Collectors.groupingBy() factory.
Map<Object, List<String>> groups = stream
    .collect(Collectors.groupingBy(x -> r.nextInt(3)));
System.out.println(groups.get(0).size());
System.out.println(groups.get(1).size());
System.out.println(groups.get(2).size());

In case the streams are not Stream, but one of the primitive streams like IntStream, then this .collect(Collectors) method is not available. You'll have to do it the manual way without a collector factory. It's implementation looks like this:

[Example 2.0 since 2020-04-16]

    IntStream    intStream = IntStream.iterate(0, i -> i + 1).limit(100000).parallel();
    IntPredicate predicate = ignored -> r.nextBoolean();

    Map<Boolean, List<Integer>> groups = intStream.collect(
            () -> Map.of(false, new ArrayList<>(100000),
                         true , new ArrayList<>(100000)),
            (map, value) -> map.get(predicate.test(value)).add(value),
            (map1, map2) -> {
                map1.get(false).addAll(map2.get(false));
                map1.get(true ).addAll(map2.get(true ));
            });

In this example I initialize the ArrayLists with the full size of the initial collection (if this is known at all). This prevents resize events even in the worst-case scenario, but can potentially gobble up 2NT space (N = initial number of elements, T = number of threads). To trade-off space for speed, you can leave it out or use your best educated guess, like the expected highest number of elements in one partition (typically just over N/2 for a balanced split).

I hope I don't offend anyone by using a Java 9 method. For the Java 8 version, look at the edit history.

Mark Jeronimus
  • 9,278
  • 3
  • 37
  • 50
  • 2
    Beautiful. However, the last solution for IntStream wont be thread-safe in case of a parallelized stream. The solution is much simpler than you think it is ... `stream.boxed().collect(...);`! It will do as advertised: convert the primitive `IntStream` to the boxed `Stream` version. – YoYo May 08 '15 at 03:57
  • 40
    This should be the accepted answer as it directly solves the OP question. – ejel Aug 18 '15 at 02:47
  • 33
    I wish Stack Overflow would allow the community to override the selected answer if a better one is found. – GuiSim Aug 15 '16 at 15:38
  • 4
    I'm not sure this answers the question. The question requests splitting a stream into streams - not Lists. – AlikElzin-kilaka Nov 28 '18 at 17:31
  • As I said in the beginning, you get two new lists. It's easy to extrapolate and get two streams from these lists. – Mark Jeronimus Nov 29 '18 at 09:47
  • In Python, you get `itertools.tee()`, which, much like a T junction in a pipeline, splits the stream/sequence. The magic is that any items consumed by one stream and not the other are cached in a buffer. This means that, if one stream consumes everything before the other stream consumes, you might as well dump the whole thing in a container. But if they advance more-less in parallel, then little state is held in memory. That said, I suspect threading implications are preventing this from being implemented in Java. – Julian Dec 11 '18 at 10:42
  • 2
    The accumulator function is unnecessarily verbose. Instead of `(map, x) -> { boolean partition = p.test(x); List list = map.get(partition); list.add(x); }` you can simply use `(map, x) -> map.get(p.test(x)).add(x)`. Further, I don’t see any reason why the `collect` operation shouldn’t be thread-safe. It works exactly as it is supposed to work and very closely to how `Collectors.partitioningBy(p)` would work. But I’d use an `IntPredicate` instead of `Predicate` when not using `boxed()`, to avoid boxing twice. – Holger Apr 16 '20 at 14:58
  • 1
    Righty, thanks for your input @Holger. I double-checked with thread-monitoring and indeed every map created remains thread-bound. I improved the example. – Mark Jeronimus Apr 16 '20 at 15:37
28

I stumbled across this question to my self and I feel that a forked stream has some use cases that could prove valid. I wrote the code below as a consumer so that it does not do anything but you could apply it to functions and anything else you might come across.

class PredicateSplitterConsumer<T> implements Consumer<T>
{
  private Predicate<T> predicate;
  private Consumer<T>  positiveConsumer;
  private Consumer<T>  negativeConsumer;

  public PredicateSplitterConsumer(Predicate<T> predicate, Consumer<T> positive, Consumer<T> negative)
  {
    this.predicate = predicate;
    this.positiveConsumer = positive;
    this.negativeConsumer = negative;
  }

  @Override
  public void accept(T t)
  {
    if (predicate.test(t))
    {
      positiveConsumer.accept(t);
    }
    else
    {
      negativeConsumer.accept(t);
    }
  }
}

Now your code implementation could be something like this:

personsArray.forEach(
        new PredicateSplitterConsumer<>(
            person -> person.getDateOfBirth().isPresent(),
            person -> System.out.println(person.getName()),
            person -> System.out.println(person.getName() + " does not have Date of birth")));
Ludger
  • 991
  • 2
  • 11
  • 22
20

Unfortunately, what you ask for is directly frowned upon in the JavaDoc of Stream:

A stream should be operated on (invoking an intermediate or terminal stream operation) only once. This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream.

You can work around this using peek or other methods should you truly desire that type of behaviour. In this case, what you should do is instead of trying to back two streams from the same original Stream source with a forking filter, you would duplicate your stream and filter each of the duplicates appropriately.

However, you may wish to reconsider if a Stream is the appropriate structure for your use case.

Trevor Freeman
  • 7,112
  • 2
  • 21
  • 40
  • 9
    The javadoc wording does not exclude partitioning into several streams as long as a single stream item only goes in _one_ of these – Thorbjørn Ravn Andersen Aug 19 '15 at 10:44
  • 2
    @ThorbjørnRavnAndersen I am not sure duplicating a stream item is the major impediment to a forked stream. The main issue is that the forking operation is essentially a terminal operation, so when you decide to fork you are basically creating a collection of some sort. E.g. I can write a method `List forkStream(Stream s)` but my resulting streams will at least partially be backed by collections and not directly by the underlying stream, as opposed to say `filter` which is not a terminal stream operation. – Trevor Freeman Aug 19 '15 at 23:39
  • 10
    This is one of the reasons I feel Java streams are a bit half-assed compared to https://github.com/ReactiveX/RxJava/wiki because the point of stream is to apply operations on a potentially infinite set of elements and real world operations frequently require splitting, duplicating and merging streams. – Usman Ismail Aug 14 '17 at 13:10
  • 1
    @TrevorFreeman : why should it be impossible? See e.g. https://stackoverflow.com/a/66526781/1587329 – serv-inc Sep 02 '22 at 09:48
  • 1
    @serv-inc I don't believe I said anything was impossible, but the answer you link to is about collecting the stream (and splitting collectors). If we simply want to collect / terminate the stream then the solution is trivial, the hard (to maybe impossible, depending on your definition) part is actually splitting a stream without terminating / collecting it. Imagine an infinite stream of data that just keeps arriving, and you wish to have two actual stream objects partitioned from this single stream... this is the difficult problem to solve. – Trevor Freeman Jan 05 '23 at 21:32
  • @TrevorFreeman: difficult maybe, but solved in scala with iterator.duplicate, right? :https://docs.scala-lang.org/overviews/collections/iterators.html . But you were right, teeing is a terminal op in java-speak, which might be undesirable – serv-inc Jan 06 '23 at 13:23
15

You can get two Streams out of one
since Java 12 with teeing
counting heads and tails in 100 coin flips

Random r = new Random();
PrimitiveIterator.OfInt coin = r.ints(0, 2).iterator();
List<Long> list = Stream.iterate(0, i -> coin.nextInt())
    .limit(100).collect(teeing(
        filtering(i -> i == 1, counting()),
        filtering(i -> i == 0, counting()),
        (heads, tails) -> {
          return(List.of(heads, tails));
        }));
System.err.println("heads:" + list.get(0) + " tails:" + list.get(1));

gets eg.: heads:51 tails:49

Kaplan
  • 2,572
  • 13
  • 14
  • https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/stream/Collectors.html#teeing(java.util.stream.Collector,java.util.stream.Collector,java.util.function.BiFunction) – Matthew Aug 25 '21 at 08:12
  • 1
    This seems to be the correct answer for splitting a stream. – cody.tv.weber Sep 26 '22 at 15:45
10

Not exactly. You can't get two Streams out of one; this doesn't make sense -- how would you iterate over one without needing to generate the other at the same time? A stream can only be operated over once.

However, if you want to dump them into a list or something, you could do

stream.forEach((x) -> ((x == 0) ? heads : tails).add(x));
Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • 82
    Why doesn't it make sense? Since a stream is a pipeline there's no reason it couldn't create two producers of the original stream, I could see this being handled by a collector that provides two streams. – Brett Ryan Feb 12 '14 at 07:24
  • 38
    Not thread safe. Bad advice trying to add directly to a collection, that is why we have the `stream.collect(...)` for with predefined thread-safe `Collectors`, that work well even on non-thread-safe Collections (with no synchronized lock contention). Best answer by @MarkJeronimus. – YoYo May 08 '15 at 04:04
  • 1
    @JoD It is thread-safe if heads and tails are thread-safe. Additionally, assuming the use of non-parallel streams, only the order is not guaranteed, so they are thread-safe. It's up to the programmer to fix concurrency issues, so this answer is perfectly suitable if the collections are thread safe. – Nicolas Feb 15 '16 at 23:55
  • 1
    @Nixon it is not suitable in the presence of a better solution, which we have here. Having such code can lead to bad precedent, causing others to use it in a wrong way. Even if no parallel streams are used, it is only one step away. Good coding practices require us not to maintain state during stream operations. Next thing we do is coding in a framework like Apache spark, and same practices would really lead to unexpected results. It was a creative solution, I give that, one I might have written myself not so long ago. – YoYo Feb 16 '16 at 00:25
  • Also, Louis, this is good documentation. I would prefer this not to be deleted. This could also help someone in a different way, or someone might find my observation to be wrong, which wouldn't be the first time. – YoYo Feb 16 '16 at 00:33
  • 1
    @JoD It is not a better solution, it's factually more inefficient.That line of thinking ultimately ends up with the conclusion that all Collections should be thread safe by default to prevent unintended consequences, which is simply wrong. – Nicolas Feb 29 '16 at 20:50
  • @Nixon thread-safe usage of a collection does not imply that collections should be thread-safe. Rather than following a locking mechanism by using synchronized methods, you avoid maintaining state across threads or introducing [side effects](https://en.wikipedia.org/wiki/Side_effect_%28computer_science%29). This subject has been elaborately discussed [here](http://programmers.stackexchange.com/q/148108/210201). – YoYo Feb 29 '16 at 21:24
  • @LouisWasserman you should be able to edit your answer with the update to indicate improvements you wish to make later. – Usman Ismail Aug 14 '17 at 13:06
8

This is against the general mechanism of Stream. Say you can split Stream S0 to Sa and Sb like you wanted. Performing any terminal operation, say count(), on Sa will necessarily "consume" all elements in S0. Therefore Sb lost its data source.

Previously, Stream had a tee() method, I think, which duplicate a stream to two. It's removed now.

Stream has a peek() method though, you might be able to use it to achieve your requirements.

ZhongYu
  • 19,446
  • 5
  • 33
  • 61
  • 1
    `peek` is exactly what used to be `tee`. – Louis Wasserman Nov 12 '13 at 21:44
  • 1
    With Java 12 `Collectors` got a new method `teeing()` which, however, is somewhat *unmanageable*. An example is [here](https://stackoverflow.com/questions/19940319/can-you-split-a-stream-into-two-streams/66526781#66526781). – Kaplan Mar 10 '21 at 09:12
6

not exactly, but you may be able to accomplish what you need by invoking Collectors.groupingBy(). you create a new Collection, and can then instantiate streams on that new collection.

aepurniet
  • 1,719
  • 16
  • 24
2

This was the least bad answer I could come up with.

import org.apache.commons.lang3.tuple.ImmutablePair;
import org.apache.commons.lang3.tuple.Pair;

public class Test {

    public static <T, L, R> Pair<L, R> splitStream(Stream<T> inputStream, Predicate<T> predicate,
            Function<Stream<T>, L> trueStreamProcessor, Function<Stream<T>, R> falseStreamProcessor) {

        Map<Boolean, List<T>> partitioned = inputStream.collect(Collectors.partitioningBy(predicate));
        L trueResult = trueStreamProcessor.apply(partitioned.get(Boolean.TRUE).stream());
        R falseResult = falseStreamProcessor.apply(partitioned.get(Boolean.FALSE).stream());

        return new ImmutablePair<L, R>(trueResult, falseResult);
    }

    public static void main(String[] args) {

        Stream<Integer> stream = Stream.iterate(0, n -> n + 1).limit(10);

        Pair<List<Integer>, String> results = splitStream(stream,
                n -> n > 5,
                s -> s.filter(n -> n % 2 == 0).collect(Collectors.toList()),
                s -> s.map(n -> n.toString()).collect(Collectors.joining("|")));

        System.out.println(results);
    }

}

This takes a stream of integers and splits them at 5. For those greater than 5 it filters only even numbers and puts them in a list. For the rest it joins them with |.

outputs:

 ([6, 8],0|1|2|3|4|5)

Its not ideal as it collects everything into intermediary collections breaking the stream (and has too many arguments!)

Ian Jones
  • 1,998
  • 1
  • 17
  • 17
2

I stumbled across this question while looking for a way to filter certain elements out of a stream and log them as errors. So I did not really need to split the stream so much as attach a premature terminating action to a predicate with unobtrusive syntax. This is what I came up with:

public class MyProcess {
    /* Return a Predicate that performs a bail-out action on non-matching items. */
    private static <T> Predicate<T> withAltAction(Predicate<T> pred, Consumer<T> altAction) {
    return x -> {
        if (pred.test(x)) {
            return true;
        }
        altAction.accept(x);
        return false;
    };

    /* Example usage in non-trivial pipeline */
    public void processItems(Stream<Item> stream) {
        stream.filter(Objects::nonNull)
              .peek(this::logItem)
              .map(Item::getSubItems)
              .filter(withAltAction(SubItem::isValid,
                                    i -> logError(i, "Invalid")))
              .peek(this::logSubItem)
              .filter(withAltAction(i -> i.size() > 10,
                                    i -> logError(i, "Too large")))
              .map(SubItem::toDisplayItem)
              .forEach(this::display);
    }
}
0

Shorter version that uses Lombok

import java.util.function.Consumer;
import java.util.function.Predicate;

import lombok.RequiredArgsConstructor;

/**
 * Forks a Stream using a Predicate into postive and negative outcomes.
 */
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PROTECTED)
public class StreamForkerUtil<T> implements Consumer<T> {
    Predicate<T> predicate;
    Consumer<T> positiveConsumer;
    Consumer<T> negativeConsumer;

    @Override
    public void accept(T t) {
        (predicate.test(t) ? positiveConsumer : negativeConsumer).accept(t);
    }
}
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
-3

How about:

Supplier<Stream<Integer>> randomIntsStreamSupplier =
    () -> (new Random()).ints(0, 2).boxed();

Stream<Integer> tails =
    randomIntsStreamSupplier.get().filter(x->x.equals(0));
Stream<Integer> heads =
    randomIntsStreamSupplier.get().filter(x->x.equals(1));
Matthew
  • 757
  • 11
  • 19
  • 1
    Since the supplier is called twice, you will get two different random collection. I think it's the OP's mind to split the odds from the evens in the **same** generated sequence – usr-local-ΕΨΗΕΛΩΝ Apr 11 '17 at 17:33