Alternative to groupBy operator in Reactor

Question

This is sort of followup question to this question. Proposed solution in answers is to use the groupBy operator. This is generally fine, but as mentioned in its docs, it is not recommended to use with large number of distinct keys, lets say tens of thousands.

data
  .groupBy(Data::getPointID)
  .flatMap(sameIdFlux -> sameIdFlux
    .concatMap(processor::process)
  )
  .subscribe();

Each group has unlimited elements in nature which may arrive at any time. Also I need to limit the number of groups, which are concurrently processed. As I understand it, if I would use the code above, either I would reach the implicit limit of open groups and new ones won´t be opened (processed), or Ï would eventually reach out-of-memory since even long-time inactive groups would not be closed (think deleted entities) and therefore consume some memory overhead for nothing.

Is there some operator / pattern I could use to achieve same behavior without running into above mentioned problems? I originally tried to close each group with some reasonable Duration, but then I am open to race-conditions when one group closes and same Id arrives, therefore they would be processed in parallel which is not desired.

EDIT: I was investigating a bit more and trying more approaches, and currently my biggest problem seems to be how to properly manage backpressure / correctly limit maximum concurrency without limiting number of groups itself. Data generation is usually linear but can sometimes produce large spikes I need to limit accordingly.

score 1 · Accepted Answer · answered Mar 25 '21 at 09:55

I am new to the world of spring-flux and project-reactor so I am not aware of any pattern out-of-the-box that you can solve your problem. However, you can create your own pattern to limit the number of groups that you create using the groupBy operator.

In the example bellow I used the pattern of int partition = i % numberOfPartitions; inspired by this blog post of Apache Flink that decides the number of partitions to split a stream.

    public Flux<GroupedFlux<Integer, Data>> createFluxUsingGroupBy(List<String> dataList, int numberOfPartitions, int maxCount) {
        return Flux
                .fromStream(IntStream.range(0, maxCount)
                        .mapToObj(i -> {
                            int randomPosition = ThreadLocalRandom.current().nextInt(0, dataList.size());
                            int partition = i % numberOfPartitions;
                            return new Data(i, dataList.get(randomPosition), partition);
                        })
                )
                .delayElements(Duration.ofMillis(10))
                .log()
                .groupBy(Data::getPartition);
    }
........
@lombok.Data
@AllArgsConstructor
@NoArgsConstructor
public class Data {
    private Integer key;
    private String value;
    private Integer partition;
}

When I execute it using numberOfPartitions = 3 I will have partitions from 0 to 2 (3 partitions) regardless the key that I am using.

    @Test
    void testFluxUsingGroupBy() {
        int numberOfPartitions = 3;
        int maxCount = 100;
        Flux<GroupedFlux<Integer, Data>> dataGroupedFlux = fluxAndMonoTransformations.createFluxUsingGroupBy(expect, numberOfPartitions, maxCount);
        StepVerifier.create(dataGroupedFlux)
                .expectNextCount(numberOfPartitions)
                .verifyComplete();
    }

here is the log:

10:43:02.168 [Test worker] INFO reactor.Flux.ConcatMap.1 - onSubscribe(FluxConcatMap.ConcatMapImmediate)
10:43:02.179 [Test worker] INFO reactor.Flux.ConcatMap.1 - request(256)
10:43:02.291 [parallel-1] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=0, value=Spring, partition=0))
10:43:02.362 [parallel-1] INFO reactor.Flux.ConcatMap.1 - request(1)
10:43:02.375 [parallel-2] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=1, value=Scala, partition=1))
10:43:02.377 [parallel-2] INFO reactor.Flux.ConcatMap.1 - request(1)
10:43:02.388 [parallel-3] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=2, value=reactive programming, partition=2))
10:43:02.389 [parallel-3] INFO reactor.Flux.ConcatMap.1 - request(1)
10:43:02.400 [parallel-4] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=3, value=java with lambda, partition=0))
10:43:02.411 [parallel-1] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=4, value=Spring, partition=1))
10:43:02.422 [parallel-2] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=5, value=java 8, partition=2))
10:43:02.433 [parallel-3] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=6, value=java with lambda, partition=0))
10:43:02.444 [parallel-4] INFO reactor.Flux.ConcatMap.1 - onNext(Data(key=7, value=java with lambda, partition=1))
...

To enhance this solution in case there is no private Integer key; available on the Data object I can generate the partition based on a hash. I used another parameter that is the parallelism. It is basically for a restore operation if you save the values on a storage using a parallelism of X and when you afterwards read the same values but using a different parallelism != X you can preserve the values on the same group. So I used int partition = (getDifferentHashCode(value) * parallelism) % numberOfPartitions; which is also inspired by the blog post that I mentioned. I prefer this approach.

    public Flux<GroupedFlux<Integer, Data>> createFluxUsingHashGroupBy(List<String> dataList, int numberOfPartitions, int parallelism, int maxCount) {
        return Flux
                .fromStream(IntStream.range(0, maxCount)
                        .mapToObj(i -> {
                            int randomPosition = ThreadLocalRandom.current().nextInt(0, dataList.size());
                            String value = dataList.get(randomPosition);
                            int partition = (getDifferentHashCode(value) * parallelism) % numberOfPartitions;
                            return new Data(i, value, partition);
                        })
                )
                .delayElements(Duration.ofMillis(10))
                .log()
                .groupBy(Data::getPartition);
    }

    public int getDifferentHashCode(String value) {
        int hash = 7;
        for (int i = 0; i < value.length(); i++) {
            hash = hash * 31 + value.charAt(i);
        }
        return hash;
    }

unit test:

    @Test
    void testFluxUsingHashGroupBy() {
        int numberOfPartitions = 3;
        int parallelism = 2;
        int maxCount = 100;
        Flux<GroupedFlux<Integer, Data>> dataGroupedFlux = fluxAndMonoTransformations.createFluxUsingHashGroupBy(expect, numberOfPartitions, parallelism, maxCount);
        StepVerifier.create(dataGroupedFlux)
                .expectNextCount(numberOfPartitions)
                .verifyComplete();
    }

Regarding the backpressure questions, I think it could come in another SO question.

This is actually pretty clever and I am mad at myself for not thinking about this. I did not realize I don´t actually need N groups for each ID, just M partitions according to my preferred max concurrency, basically Kafka approach (which I already use, that´s why I am mad). Thanks. — B.Gen.Jack.O.Neill, Mar 25 '21 at 12:02
glad that I could help. StackOverflow also gave me a lot of answers in the past that helped my own projects :) — Felipe, Mar 25 '21 at 12:05

Alternative to groupBy operator in Reactor

1 Answers1