1

I am using stream API reduce to test a array list of String.

for (int i = 0; i < 100; i++)
    {
        Stream<String> s1 = Stream.of("aa", "ab", "c", "ad");
        Predicate<String> predicate = t -> t.contains("a");

        List<String> strings2 = new ArrayList<>();
        s1.parallel().reduce(new ArrayList<String>(),
                new BiFunction<ArrayList<String>, String, ArrayList<String>>()
                {
                    @Override
                    public ArrayList<String> apply(ArrayList<String> strings, String s)
                    {
                        if (predicate.test(s))
                        {
                            strings.add(s);
                        }

                        return strings;
                    }
                }, new BinaryOperator<ArrayList<String>>()
                {
                    @Override
                    public ArrayList<String> apply(ArrayList<String> strings,
                            ArrayList<String> strings2)
                    {
                        return strings;
                    }
                }).stream().forEach( //
                        e -> {
                            strings2.add(e);
                        });

        if (strings2.contains(null))
        {
            System.out.println(strings2);
        }
    }
}

I went through couple of the blogs which says reduce could be used in this case, and the synchronization could be guaranteed, but the case above looks like this is not true. This test is TRUE in couple test runs,

strings2.contains(null)

so my question is that: Is the way in which I use reduce is incorrect, or there is something extra should be done to make sure the sych?

Holger
  • 285,553
  • 42
  • 434
  • 765
Robin Sun
  • 45
  • 7
  • `s1.parallel()` creates parallel streams and as stated in [Parallelism tutorial](https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html) `Aggregate operations and parallel streams enable you to implement parallelism with non-thread-safe collections provided that you do not modify the collection while you are operating on it` the accumulator and the combiner functions should not modify the `strings` list. Using serial streams (`s1.reduce(...`) could be a solution to guarantee "synchronization" –  Apr 22 '19 at 15:55
  • Whatever blogs you’ve read, stop reading them. `List strings2 = Stream.of("aa", "ab", "c", "ad") .parallel() .filter(t -> t.contains("a")) .collect(Collectors.toList());` is the right way to do that. What you’re doing, is broken *and* horribly complicated. – Holger Apr 23 '19 at 15:49
  • @Holger At first and foremost, thank you for the comments you have done here. My point for this question is not 'it's better to use `filter` here or `reduce` here, but I do want to learn how to use `reduce` in an appropriate way, in other words, in a thread-safe way. I thougt `reduce` function is thread-safe, however what it shows here has proven that I was wrong, so I am wondering if there is thread-safe way to `reduce` here. – Robin Sun Apr 25 '19 at 11:45
  • `reduce` is thread safe, if used correctly. In fact, every thread safe construct is only safe when being used correctly. But when a blog says “reduce could be used in *this* case”, it’s horribly wrong. – Holger Apr 25 '19 at 12:49

1 Answers1

1

It looks like filter is better suited to tackle this problem. If you want to use reduce, though, and especially when using it in parallel, you must not modify the accumulator objects (the lists in your case).

From the Oracle tutorial on reduction:

the accumulator function also returns a new value every time it processes an element

When I ran your code, I got two prints of list containing null, followed by an ArrayIndexOutOfBoundsException. The likely reason for this is that two threads tried to add elements to the same list at the same time. The exception occurred after the list was made longer, but before the element was added, hence the null (i.e. empty) slot.

ArrayList<String> strings2 = 
    s1.parallel()
      .reduce(new ArrayList<String>(), 
              (list, el) -> {
                if (el.contains("a")) {
                    ArrayList<String> added = new ArrayList<>(list);
                    added.add(el);
                    return added;
                }
                return list;
              }, 
              (list1, list2) -> {
                ArrayList<String> merged = new ArrayList<>(list1);
                merged.addAll(list2);
                    return merged;
              });

Instead of adding to the list, you have to make a copy of it, add to that copy and return the copy. This way, each thread can work on different parts of the input without interfering with others.

In addition, you cannot just throw out part of the result in the combiner, otherwise you will end up with incomplete results. You have to merge the lists instead of simply returning one of them.

Malte Hartwig
  • 4,477
  • 2
  • 14
  • 30
  • The linked tutorial also mentions the performance penalty of creating many new accumulator lists. For such cases they recommend using `Stream.collect` instead which performs a "mutable" reduction, i.e. your are in fact allowed to modify the accumulator objects. – Malte Hartwig Apr 23 '19 at 04:50
  • In the question’s code, the combiner was not throwing away results, as it received the same list in both arguments. However, that doesn’t make it a correct solution… – Holger Apr 23 '19 at 15:52
  • The point you mentioned that 'two threads tried to add elements to the same list at the same time.', I totally agree with this, and the simple why in which I can fix this is to use `CopyOnWriteArrayList` instead of `ArrayList`. Well...the main point I want get is that I want to learn more about how to use `reduce` correctly. – Robin Sun Apr 25 '19 at 11:36
  • And for the `combiner`, I believe it's a correct for a list that not using `addAll` for the second argument, which is the same list as the first argument. – Robin Sun Apr 25 '19 at 11:41
  • @RobinSun to understand `reduce`, you have to grasp the concept of *values*. A value is something that can’t be modified, there can only be different values of a type. You can evaluate a value (or multiple values) to a new value. When you think of immutable lists, you can concatenate lists to a new list and an empty list is the identity value for concatenation, but when you modify the `List` specified as first argument to `reduce`, it is not an identity value anymore and the entire operation is wrong, semantically, regardless of thread safety. – Holger Apr 25 '19 at 12:55
  • In practice, using a `CopyOnWriteArrayList` may eliminate structural inconsistencies, like encountering `null` elements or missing updates, etc., but it’s still an entirely wrong usage and may lead to getting elements in the wrong order. Even if it a particular wrong use of an API happens to produce the desired result, a wrong API usage is not guaranteed to work in different environments or future versions. – Holger Apr 25 '19 at 12:58
  • After read turoial again, went throught Malte's answer and the comment you added carefully, I realized that the _values_ you mentioned here, which is just like the input/original list in Malte's example, should be kept unmodified. Thank you, both of you. @Holger – Robin Sun Apr 28 '19 at 06:57