0

I am newbie to Apache flink i am trying to filter words that starts with letter "N" and i am getting output but how can i get words which don't starts with word "N" below is the code i am using

package DataStream;

import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class WordStream {

    public static void main(String[] args) throws Exception {

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<String> inputData = env.socketTextStream("localhost", 9999);

        DataStream<String> filterData = inputData.filter(new FilterFunction<String>() {

            /**
             * 
             */
            private static final long serialVersionUID = 1L;

            @Override
            public boolean filter(String value) throws Exception {
                return value.startsWith("N");
            }
        });

        DataStream<Tuple2<String, Integer>> tokenize = filterData
                .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {

                    @Override
                    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                        out.collect(new Tuple2<String, Integer>(value, Integer.valueOf(1)));

                    }
                });

        DataStream<Tuple2<String, Integer>> counts = tokenize.keyBy(0).sum(1);

        counts.print();

        env.execute("WordStream");

    }

}

Can you suggest how to capture not matched words to another stream.

YRK
  • 153
  • 1
  • 1
  • 22

2 Answers2

2

I think you can make use of side-output to achieve this. Just emit the matched elements in actual collector and unmatched element with side-output tag using ProcessFunction, then fetch the side-output elements from the main stream.

For an example, your code can be changed something like this,

package datastream;


import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

public class WordStream {

    public static void main(String[] args) throws Exception {

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<String> inputData = env.socketTextStream("localhost", 9999);

        // Initialize side-output tag to collect the un-matched elements 
        OutputTag<Tuple2<String, Integer>> unMatchedSideOutput = new OutputTag<Tuple2<String, Integer>>("unmatched-side-output") {};

        SingleOutputStreamOperator<Tuple2<String, Integer>> tokenize = inputData
                .process(new ProcessFunction<String, Tuple2<String, Integer>>() {
                    @Override
                    public void processElement(String value, Context ctx, Collector<Tuple2<String, Integer>> out) {
                        if (value.startsWith("N")) {
                            // Emit the data to actual collector
                            out.collect(new Tuple2<>("Matched=" + value, Integer.valueOf(1)));
                        } else {
                            // Emit the un-matched data to side output
                            ctx.output(unMatchedSideOutput, new Tuple2<>("UnMatched=" + value, Integer.valueOf(1)));
                        }
                    }
                });

        DataStream<Tuple2<String, Integer>> count = tokenize.keyBy(0).sum(1);

        // Fetch the un-matched element using side-output tag and process it
        DataStream<Tuple2<String, Integer>> unMatchedCount = tokenize.getSideOutput(unMatchedSideOutput).keyBy(0).sum(1);

        count.print();

        unMatchedCount.print();

        env.execute("WordStream");

    }
}

Note: I slightly changed the emitted value with prefix Matched= and UnMatched= to get the clear understanding in the output.

For the below input,

Hello
Nevermind
Hello

I get the following output,

3> (UnMatched=Hello,1)
4> (Matched=Nevermind,1)
3> (UnMatched=Hello,2)
Jaya Ananthram
  • 3,433
  • 1
  • 22
  • 37
  • is it possible to do same thing with Filterfunction without using processfunction – YRK May 12 '20 at 14:33
  • Nope, you can't because, [filter method interface](https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/functions/FilterFunction.java#L59) doesn't provides you the context object to emit by yourself. Thats the reason we have to use [ProcessFunction](https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/ProcessFunction.java#L70) which provides you rich objects like Context and Collector. – Jaya Ananthram May 12 '20 at 14:45
1

A simpler solution:

DataStream<String> nwords = input.filter(s -> startsWith("N"));
DataStream<String> others = input.filter(s -> !startsWith("N"));

I believe this is slightly less efficient than the solution using a side output, but it will still run in a single task, using operator chaining, so it also requires no ser/de overhead, or networking.

Don't get me wrong -- in general side outputs are the way to go for splitting streams.

David Anderson
  • 39,434
  • 4
  • 33
  • 60