How to reuse application of filter & map on a Stream?

Question

I have a set of domain objects that inherit from a shared type (i.e. GroupRecord extends Record, RequestRecord extends Record). The subtypes have specific properties (i.e. GroupRecord::getCumulativeTime, RequestRecord::getResponseTime).

Further, I have a list of records with mixed subtypes as a result of parsing a logfile.

List<Record> records = parseLog(...);

In order to compute statistics on the log records, I want to apply math functions only on a subset of the records that matches a specific subtype, i.e. only on GroupRecords. Therefore I want to have a filtered stream of specific subtypes. I know that I can apply a filter and map to a subtype using

records.stream()
       .filter(GroupRecord.class::isInstance)
       .map(GroupRecord.class::cast)
       .collect(...

Apply this filter&cast on the stream multiple times (especially when doing it for the same subtype multiple times for different computations) is not only cumbersome but produces lots of duplication.

My current approach is to use a TypeFilter

class TypeFilter<T>{

    private final Class<T> type;

    public TypeFilter(final Class<T> type) {
        this.type = type;
    }

    public Stream<T> filter(Stream<?> inStream) {
        return inStream.filter(type::isInstance).map(type::cast);
    }
}

To be applied to a stream:

TypeFilter<GroupRecord> groupFilter = new TypeFilter(GroupRecord.class); 

SomeStatsResult stats1 = groupFilter.filter(records.stream())
                                      .collect(...)
SomeStatsResult stats2 = groupFilter.filter(records.stream())
                                      .collect(...)

It works, but I find this approach a bit much for such a simple task. Therefore I wonder, is there a better or what is the best way for making this behavior reusable using streams and functions in a concise and readable way?

can you group those entries may be? `records.stream().collect(Collectors.groupingBy(Record::getClass));` and then simply do a get for the class you are interested in? — Eugene, May 29 '17 at 12:17
@Eugene, if the goal is to not iterate records several times, then collecting filtered objects into intermediate list and then processing two times does way more good than collecting everything into a map and only using one of its entries. Also, grouping by class doesn't give you type safety, you'd still have to cast manually. — M. Prokhorov, May 29 '17 at 12:24
So you are filtering and then casting many types over the whole dataset to calculate different statistics. I don't consider this to be the best approach. Instead, you should be able to calculate all statistics in one single pass. Besides, if you also need to calculate several (other) statistics for the other subtypes of `Record`, you should be able to still do everything in a single pass. Only difference is that each set of statistics would need to be calculated over a different subset of the whole dataset. No matter what your specific use case is, I think you'd need a custom collector. — fps, May 29 '17 at 15:09

M. Prokhorov · Accepted Answer · 2017-05-29T13:06:58.557

It depends on what do you find "more concise and readable". I myself would argue that the way you already implemented is fine as it is.

However, there is indeed a way to do this in a way that is slightly shorter from the point of where you use it, by using Stream.flatMap:

static <E, T> Function<E, Stream<T>> onlyTypes(Class<T> cls) {
  return el -> cls.isInstance(el) ? Stream.of((T) el) : Stream.empty();
}

What it would do is it will convert each original stream element to either a Stream of one element if the element has expected type, or to an empty Stream if it does not.

And the use is:

records.stream()
  .flatMap(onlyTypes(GroupRecord.class))
  .forEach(...);

There are obvious tradeoffs in this approach:

You do lose the "filter" word from your pipeline definition. That may be more confusing that the original, so maybe a better name than onlyTypes is needed.
Stream objects are relatively heavyweight, and creating so much of them may result in performance degradation. But you should not trust my word here and profile both variants under heavy load.

Edit:

Since the question asks about reusing filter and map in slightly more general terms, I feel like this answer can also discuss a little more abstraction. So, to reuse filter and map in general terms, you need the following:

static <E, R> Function<E, Stream<R>> filterAndMap(Predicate<? super E> filter, Function<? super E, R> mapper) {
   return e -> filter.test(e) ? Stream.of(mapper.apply(e)) : Stream.empty();
}

And original onlyTypes implementation now becomes:

static <E, R> Function<E, Stream<R>> onlyTypes(Class<T> cls) {
  return filterAndMap(cls::isInstance, cls::cast);
}

But then, there is yet again a tradeoff: resulting flat mapper function will now hold captured two objects (predicate and mapper) instead of single Class object in above implementation. It may also be a case of over-abstracting, but that one depends on where and why you would need that code.

Holger · Answer 2 · 2017-05-29T16:07:45.107

You don’t need an entire class to encapsulate a piece of code. The smallest code unit for that purpose, would be a method:

public static <T> Stream<T> filter(Collection<?> source, Class<T> type) {
    return source.stream().filter(type::isInstance).map(type::cast);
}

This method can be used as

SomeStatsResult stats1 = filter(records, GroupRecord.class)
                            .collect(...);
SomeStatsResult stats2 = filter(records, GroupRecord.class)
                            .collect(...);

If the filtering operation isn’t always the first step in your chain, you may overload the method:

public static <T> Stream<T> filter(Collection<?> source, Class<T> type) {
    return filter(source.stream(), type);
}
public static <T> Stream<T> filter(Stream<?> stream, Class<T> type) {
    return stream.filter(type::isInstance).map(type::cast);
}

However, if you have to repeat this operation multiple times for the same type, it might be beneficial to do

List<GroupRecord> groupRecords = filter(records, GroupRecord.class)
                            .collect(Collectors.toList());
SomeStatsResult stats1 = groupRecords.stream().collect(...);
SomeStatsResult stats2 = groupRecords.stream().collect(...);

not only eliminating the code duplication in source code, but also performing the runtime type checks only once. The impact of the required additional heap space depends on the actual use case.

holi-java · Answer 3 · 2017-05-29T20:17:19.357

3

WHAT you actually need is a Collector to collecting all elements in the stream that is instance of special type. It can solving your problem easily and avoiding filtering the stream twice:

List<GroupRecord> result = records.stream().collect(
      instanceOf(GroupRecord.class, Collectors.toList())
); 

SomeStatsResult stats1 = result.stream().collect(...);
SomeStatsResult stats2 = result.stream().collect(...);

AND you can do something as further like as Stream#map by using Collectors#mapping, for example:

List<Integer> result = Stream.of(1, 2L, 3, 4.)
   .collect(instanceOf(Integer.class, mapping(it -> it * 2, Collectors.toList())));
               |                                                       |  
               |                                                     [2,6]
             [1,3]

WHERE you only want to consuming the Stream once, you can easily composing the last Collector as below:

SomeStatsResult stats = records.stream().collect(
      instanceOf(GroupRecord.class, ...)
);

static <T, U extends T, A, R> Collector<T, ?, R> instanceOf(Class<U> type
        , Collector<U, A, R> downstream) {
    return new Collector<T, A, R>() {
        @Override
        public Supplier<A> supplier() {
            return downstream.supplier();
        }

        @Override
        public BiConsumer<A, T> accumulator() {
            BiConsumer<A, U> target = downstream.accumulator();
            return (result, it) -> {
                if (type.isInstance(it)) {
                    target.accept(result, type.cast(it));
                }
            };
        }

        @Override
        public BinaryOperator<A> combiner() {
            return downstream.combiner();
        }

        @Override
        public Function<A, R> finisher() {
            return downstream.finisher();
        }

        @Override
        public Set<Characteristics> characteristics() {
            return downstream.characteristics();
        }
    };
}

Why did you need to composes Collectors?

Did you remember Composition over Inheritance Principle? Did you remember assertThat(foo).isEqualTo(bar) and assertThat(foo, is(bar)) in unit-test?

Composition is much more flexible, it can reuses a piece of code and composeing components together on runtime, that is why I prefer hamcrest rather than fest-assert since it can composing all possible Matchers together. and that is why functional programming is most popular since it can reuses any smaller piece of function code than class level reusing. and you can see jdk has introduced Collectors#filtering in jdk-9 that will make the execution routes shorter without losing its expressiveness.

AND you can refactoring the code above according to Separation of Concerns as further, then filtering can be reused like as jdk-9 Collectors#filtering:

static <T, U extends T, A, R> Collector<T, ?, R> instanceOf(Class<U> type
        , Collector<U, A, R> downstream) {
  return filtering(type::isInstance, Collectors.mapping(type::cast, downstream));
}

static <T, A, R>
Collector<T, ?, R> filtering(Predicate<? super T> predicate
        , Collector<T, A, R> downstream) {
    return new Collector<T, A, R>() {
        @Override
        public Supplier<A> supplier() {
            return downstream.supplier();
        }

        @Override
        public BiConsumer<A, T> accumulator() {
            BiConsumer<A, T> target = downstream.accumulator();
            return (result, it) -> {
                if (predicate.test(it)) {
                    target.accept(result, it);
                }
            };
        }

        @Override
        public BinaryOperator<A> combiner() {
            return downstream.combiner();
        }

        @Override
        public Function<A, R> finisher() {
            return downstream.finisher();
        }

        @Override
        public Set<Characteristics> characteristics() {
            return downstream.characteristics();
        }
    };
}

edited May 29 '17 at 20:17

answered May 29 '17 at 14:58

holi-java

29,655
7
72
83

Hi holi-java! Why are you filtering in the collector? For this specific use case, I would use `filter` and `map` and collect to a list. Using a collector would be the best approach if you need to calculate all the statistics at once (in a single pass), otherwise I see no point in using this. – fps May 29 '17 at 15:16
@FedericoPeraltaSchaffner sir, see my edited answer please. – holi-java May 29 '17 at 15:18
1

OK, I see your point... You could now use your collector as a downstream collector or use another downstream collector after yours. Still, I think that having the type check and the cast inside the collector is overkill, unless you do something else, such as collecting all statistics together. – fps May 29 '17 at 15:23
@FedericoPeraltaSchaffner no, filtering elements in collectors sometimes, will make the execution routes shorter. are you remember unit test asserts? `assertThat(foo).is(bar)` v.s `assertThat(foo,is(bar))`. the seconds is more flexible that you can combine all possible `Matchers` into your `assertThat`. so that is why I prefer `hamcrest` rather than `fest-assert` – holi-java May 29 '17 at 15:23
2

Yes, I'm aware of `Collectors.filtering` in jdk9. It only makes sense to use it as a downstream collector. The docs you linked also say that `The filtering() collectors are most useful when used in a multi-level reduction, such as downstream of a groupingBy or partitioningBy`. – fps May 29 '17 at 15:40
@FedericoPeraltaSchaffner thanks for your feedback. My answer according to SoC now. it is beautiful and low coupling. – holi-java May 29 '17 at 17:33
I understand using collectors as a way to collect `Stats` of some description, and i will [always](https://stackoverflow.com/a/43050269/7470253) [argue](https://stackoverflow.com/a/42557741/7470253) that collecting directly into `Stats` object is usually superior to first collecting to list and then iterating the list to get `Stats`. What I don't understand is why do we need a filtering collector for this typecheck problem when we only need to collect single type (means `filter()` will suffice). Collector is by design a more complex entity, so I see this solution as an overkill for this case. – M. Prokhorov Jun 01 '17 at 11:28
@M.Prokhorov well, after answer this question, I think it too more. maybe my another [question](https://stackoverflow.com/questions/44258757/why-stream-operations-is-duplicated-with-collectors) will clarify your confusion. and I'm sure `Collectors` is more effectively than `Stream` operations in this case. – holi-java Jun 01 '17 at 13:59
Do, I don't think of collector operations as more efficient. The fact that there are some similarities between Streaming and Collecting operation doesn't mean that they are interchangeable. By applying `map()` to stream, you tranform each element of the stream to something else, and you then may delegate further operations elsewhere. By using `mapping()` collectors, you first materialize the whole transformed dataset into memory, and only then pass result elsewhere. Now imagine what would happen if stream contains 1TB of data overall, and "elsewhere" only needs a single first element. – M. Prokhorov Jun 01 '17 at 14:43
@M.Prokhorov your case I have described in my question that is `Collector`s not support short-circuiting, sir. On the other hand, `Collector`s is another way to collecting elements when operates on a `Stream` is tricky. and I never say `Collector`s is omnipotent. and you can see clearly the OP really need a `Collector` in this case. – holi-java Jun 01 '17 at 14:48
@holi-java, he indeed needs a collector, but he doesn't need `filtering()` Collector specifically, nor does he need a `mapping()` collector - he maybe needs a collector that would make collecting `Stats` easier, that's about it. There are actually not that many places where you don't need collectors when interfacing with legacy code: because most legacy works with data structures, and `Stream` is not a data structure, it is a data stream. – M. Prokhorov Jun 01 '17 at 14:58
@M.Prokhorov sir, you must realized both `map()` & `mapping()` actually occurs in terminal operations. so It is trivial to comparing between them. – holi-java Jun 01 '17 at 14:59
@M.Prokhorov in all, I prefer `filter()` if only operate on some simpler operation on a `Stream`. otherwise I'll using a `filtering()`. – holi-java Jun 01 '17 at 15:03
`map()` is not a terminal operation, and cannot occur during terminal operation, unless it is done on another Stream. It is a lazily executed intermediate operation. And I don't see `mapping` collector anywhere in this question text. – M. Prokhorov Jun 01 '17 at 15:04
@M.Prokhorov what I said is `map()` behavior only occurs in termnal operation. all of intermediate operations is lazily. We need to stop talking it... if you really like `Stream` just do it. I'm sorry I'm not good at English. but I struggle for answer your confusion, so I have no regrets. – holi-java Jun 01 '17 at 15:05
@holi-java, this justification is not correct. You should prefer direct filter operations if you don't need to build complex data structure from a `Stream`. If you do need a Data structure, then you should use collector composition. I've yet to find a valid use for `filtering` collector though - it can always be handled by a filter on Stream pipeline. – M. Prokhorov Jun 01 '17 at 15:06

How to reuse application of filter & map on a Stream?

3 Answers3

Why did you need to composes Collectors?

Linked