2

Below is a flink program (Java) which reads tweets from a file, extract hash tags, count the number of repetition for each hash tag and finally write in a file.

Now In this program there is a sliding Window of size 20 seconds that slides by 5 seconds. In sink all output data is getting written into file named outfile. Means after every 5 seconds one window is getting fired and writing data into outfile.

My Problem:

I want that for every window firing (means in every 5 seconds) data gets written in new file. (instead of getting appended in same file). Kindly guide where and how it can be done? Do i need to use custom trigger or any configuration regarding sink? or anything else?

Code:

<!-- language: lang-java -->

StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

env.getConfig().setAutoWatermarkInterval(100);

env.enableCheckpointing(5000,CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);

String path = "C:\\Users\\eventTime";
// Reading data from files of folder eventTime.
DataStream<String> streamSource = env.readFile(new TextInputFormat(new Path(path)), path, FileProcessingMode.PROCESS_CONTINUOUSLY, 1000).uid("read-1");

//Extracting the hash tags of tweets
DataStream<Tuple3<String, Integer, Long>> mapStream = streamSource.map(new ExtractHashTagFunction());   

//generating watermarks and extracting the timestamps from tweets
DataStream<Tuple3<String, Integer, Long>> withTimestampsAndWatermarks = mapStream.assignTimestampsAndWatermarks(new MyTimestampsAndWatermarks());

KeyedStream<Tuple3<String, Integer, Long>,Tuple> keyedStream = withTimestampsAndWatermarks.keyBy(0);

//Using sliding window of 20 seconds which slide by 5 seconds.
SingleOutputStreamOperator<Tuple4<String, Integer, Long, String>> aggregatedStream = keyedStream.**window(SlidingEventTimeWindows.of(Time.seconds(20),Time.seconds(5)))**
        .aggregate(new AggregateHashTagCountFunction()).uid("agg-123");                 

aggregatedStream.writeAsText("C:\\Users\\outfile", WriteMode.NO_OVERWRITE).setParallelism(1).uid("write-1");

env.execute("twitter-analytics");
Gaurav
  • 161
  • 1
  • 11

1 Answers1

3

If you are not satisfied with the built in sinks, you can define your custom sink:

stream.addSink(new MyCustomSink ...)

The MyCustomSink should implement SinkFunction

Your custom sink will contain a FileWriter and e.g. a counter. Every time the sink is invoked, it will write to "/path/to/file + counter.yourFileExtension"

https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/functions/sink/SinkFunction.html

Alex
  • 839
  • 1
  • 12
  • 24
  • 1
    As an aside, it would be pretty easy to have this custom sink (e.g. `PerRecordSink`) "wrap" the real sink (S3, etc) so that you get those implementations for free. – kkrugler Mar 12 '18 at 17:47
  • @Alex, Actually, when I write custom sink and put "/path/to/file + counter.yourFileExtension" in invoke() function of custom sink, then new file is generated for every record. But my requirement is to create new file for every window firing. Means one window firing can have multiple records and I want those all records in one file. I hope you got that. Kindly suggest. – Gaurav Mar 13 '18 at 03:16
  • @Gaurav yes the sink is invoked every time the window emits a record. Thus it's a matter of how many records your window emits. Therefore, you should remodel your function such that it emits e.g. a single array of records at the end, that will be written to a file at once. – Alex Mar 13 '18 at 08:20
  • @Alex, I have written a custom process() function on window, This process function taking all elements from Iterable and adding them in a local string and finally appending that string in collector of process function. And In custom sink() function. I am creating new file in invoke() function by using counter. Result: Now again new file is getting generated for every record instead of whole data of window. It means when data reaches to Sink, Sink does not know the boundary of each window's data. If you want I can share the code. – Gaurav Mar 13 '18 at 11:08
  • @Gaurav if you can provide a `minimal` example to reproduce the issue you can do so. – Alex Mar 13 '18 at 11:28
  • @Alex, Thanks Alex, Actually I was trying many things and finally got the solution. Actually, I was using keyed stream and so my original stream was getting split into multiple logical keyed streams (=number of keys per window) and invoke() of sink function was getting called for every logical keyed streams. So, I used windowAll() function of Flink, and collaborated all keyed streams into one stream and hence passed to sink. It worked.But yes I had to use custom sink function as you suggested. Thanks again. – Gaurav Mar 14 '18 at 08:22
  • @Gaurav glad that I was able to help. You might as well accept the answer and upvote. – Alex Mar 14 '18 at 09:25
  • @Alex, I have upvoted your Answer but it need other things as well like I explained above, windowAll() function etc. So, It will be misleading if I accept the answer at this stage.I hope you understand. But Thanks a lot Alex, May be we can have more discussion about FLINK in future as well. – Gaurav Mar 14 '18 at 14:23
  • @Gaurav what you have to understand is that StackOverflow is has a short "question - answer" format. If I answered you question, please accept the answer. There is no such thing as long term support. If you have another issue you will have to create another short and specific question. PS. Flink has a user mail group. More on this here: http://flink.apache.org/community.html – Alex Mar 14 '18 at 16:31
  • @Alex, Okay got it. And if you found question logical then you too can upvote it. – Gaurav Mar 14 '18 at 17:02