Filter out duplicates for CompletableFuture

Question

I wanna to filter out the duplicates after the first CompletableFuture and then invoke the second stage using another CompletableFuture. What I tried:

@FunctionalInterface
public interface FunctionWithExceptions<T, R, E extends Exception> {
    R process(T t) throws E;
}


public static <T> Predicate<T> distinctByKey(FunctionWithExceptions<? super T, ?, ?> keyExtractor) {
    Set<Object> seen = ConcurrentHashMap.newKeySet();
    return t -> {
        String key = "";
        try {
            key = (String) keyExtractor.process(t);
        } catch (Exception e) {
            log.info("Get instanceIp failed!");
        }
        return seen.add(key);
    };
}

List<CompletableFuture<InstanceDo>> instanceFutures = podNames.stream()
            .map(podName -> CompletableFuture.supplyAsync(RethrowExceptionUtil.rethrowSupplier(() -> {
                PodDo podDo = getPodRetriever().getPod(envId, podName);
                podDoList.add(podDo);
                return podDo;
            }), executor))
            .map(future -> future.thenApply(podDo -> podDo.getInstanceName()))
            .filter(distinctByKey(CompletableFuture::get))
            .map(future -> future.thenCompose(instanceName ->
                    CompletableFuture.supplyAsync(() -> get(envId, instanceName), executor)))
            .collect(Collectors.toList());

As you can see, the distinctByKey will invoke get which will directly make the concurrency to sequentiality.

What should I do to make it CONCURRENT again but meantime retain the distinct feature?

OR

I only have one choice?

To wait the whole first stage to complete and then start the second stage?

Be careful, `thenApply` and `thenCompose` are synchronous. You would probably prefer `thenApplyAsync` and `thenComposeAsync`. — kagmole, May 22 '18 at 14:05
@kagmole Thanks for the reply. But I do **not** think you get the point here. I know pretty well about the **\*Async** since they're well documented. I want the **second** just run after the **first** and **thenCompose** here is quite enough since it will work in the same thread as the **first** stage. — Hearen, May 22 '18 at 14:17
I see, and there is still indeed the `distinctByKey` problem. I may have a suggestion, based on `Optional`, but I fear it will end up unnecessarily bloated. — kagmole, May 22 '18 at 14:38
@kagmole Thanks for the **idea**, I will have a try. B.T.W it's the **distinct** problem from the very beginning actually. Hahah, thank you, man. — Hearen, May 22 '18 at 15:12
@kagmole Not a problem, man. What do you think of my **solution** below. Have some ideas to make it better? — Hearen, May 22 '18 at 15:38

score 0 · Answer 1 · answered May 22 '18 at 15:33

I just wrote a simple demo to solve this kind of issue, but I really don't know whether it's reliable or not. But at least it makes sure the second stage can be accelerated using Set<Object> seen = ConcurrentHashMap.newKeySet();.

public static void main(String... args) throws ExecutionException, InterruptedException {
        Set<Object> seen = ConcurrentHashMap.newKeySet();
        List<CompletableFuture<Integer>> intFutures = Stream.iterate(0, i -> i+1)
                .limit(5)
                .map(i -> CompletableFuture.supplyAsync(() -> {
                    int a = runStage1(i);
                    if (seen.add(a)) {
                        return a;
                    } else {
                        return -1;
                    }}))
                .map(future -> future.thenCompose(i -> CompletableFuture.supplyAsync(() -> {
                    if (i > 0) {
                        return runStage2(i);
                    } else {
                        return i;
                    }})))
                .collect(Collectors.toList());
        List<Integer> resultList = new ArrayList<>();
        try {
            for (CompletableFuture<Integer> future: intFutures) {
                resultList.add(future.join());
            }
        } catch (Exception ignored) {
            ignored.printStackTrace();
            out.println("Future failed!");
        }
        resultList.stream().forEach(out::println);
    }

    private static Integer runStage1(int a) {
        out.println("stage - 1: " + a);
        try {
            Thread.sleep(500 + Math.abs(new Random().nextInt()) % 1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return Integer.valueOf(a % 3);
    }

    private static Integer runStage2(int b) {
        out.println("stage - 2: " + b);
        try {
            Thread.sleep(200 + Math.abs(new Random().nextInt()) % 1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return Integer.valueOf(b);
    }

By returning special values in the first stage when it's duplicated and then in the second stage, via the special value (-1), I can ignore the time-consuming second-stage computation.

The output indeed filter out some redundant computation of the second stage.

stage - 1: 0
stage - 1: 1
stage - 1: 2
stage - 1: 3
stage - 2: 2 // 
stage - 2: 1 //
stage - 1: 4
0
1
2
-1
-1

I think it's not a good solution. But what can I optimise to make it better?

Why do you think it is not a good solution? The only issue I see is that the resulting list contains a mix of the results of the first stage (the `-1`) and the second stage, but you can easily filter them out. — Didier L, May 23 '18 at 09:26
You should use `ThreadLocalRandom.current()` instead of `new Random()` as it is thread-safe and the preferred alternative. Your solution is similar to what I was thinking, you just used a discriminator value `-1` instead of `Optional`, which is nice if you can do it (no `Optional` overhead). I don't see an improvement right now though. — kagmole, May 23 '18 at 09:56
@kagmole Thanks for the help, I got the point and improved it in this way. Besides, thanks for the soft reminder using **thread-safe** random, which I forgot. Thank you ~ — Hearen, May 23 '18 at 12:25
@DidierL Thanks for the point, I just tested them and it seems good. What I thought is that it could be more elegant - filtering out directly instead of checking it in the second stage and then filter. — Hearen, May 23 '18 at 12:27

score 0 · Answer 2 · answered May 23 '18 at 16:22

A small improvement compared to your submitted answer might be to use the ConcurrentHashMap as a kind of cache, so that your final list contains the same results independently of the order you got them in:

Map<Integer, CompletableFuture<Integer>> seen = new ConcurrentHashMap<>();
List<CompletableFuture<Integer>> intFutures = Stream.iterate(0, i -> i + 1)
        .limit(5)
        .map(i -> CompletableFuture.supplyAsync(() -> runStage1(i)))
        .map(cf -> cf.thenCompose(result ->
                seen.computeIfAbsent(
                        result, res -> CompletableFuture.supplyAsync(() -> runStage2(res))
                )
        ))
        .collect(Collectors.toList());

Note that it is important that the function passed to computeIfAbsent() immediately returns (so using a supplyAsync() for example), because it keeps a lock inside the map while being executed. In addition, this function must not attempt to modify the seen map because it could cause issues.

With this change the output could be for example:

stage - 1: 1
stage - 1: 0
stage - 1: 2
stage - 2: 1
stage - 2: 2
stage - 1: 3
stage - 2: 0
stage - 1: 4
0
1
2
0
1

Additionally, this allows to inspect the seen map after all futures have been completed in order to get the unique results.

Filter out duplicates for CompletableFuture

2 Answers2