0

I have a simple program using flink CEP library to detect multiple failed login from a file of log records. My application uses Event time and I am doing a keyBy on the logged in 'user'.

The program works fine when I set the StreamExecutionEnvironment parallelism to 1. It fails when parallelism is anything else. I am unable to understand why.

I can see that all records related to a particular user is going to the same thread, so why the issue. Also see that the records are on many occasions not in event time order (not sure if that is a problem) but I couldn't find anything in the api to let me sort the records by event time within a window.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setAutoWatermarkInterval(1000);
env.setParallelism(1); //tried with 1 & 4
.....    
DataStream<LogEvent> inputLogEventStream = env
    .readFile(format, FILEPATH, FileProcessingMode.PROCESS_CONTINUOUSLY, 1000)
    .map(new MapToLogEvents())              
    .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<LogEvent>(Time.seconds(0)) {
            public long extractTimestamp(LogEvent element) {
                    return element.getTimeLong();
            }
    })
    .keyBy(new KeySelector<LogEvent, String>() {
            public String getKey(LogEvent le) throws Exception {
                    return le.getUser();
            }
    });
inputLogEventStream.print();

Pattern<LogEvent, ?> mflPattern = Pattern.<LogEvent> begin("mfl")
    .subtype(LogEvent.class).where(
            new SimpleCondition<LogEvent>() {
                    public boolean filter(LogEvent logEvent) {
                            if (logEvent.getResult().equalsIgnoreCase("failed")) { return true; }
                            return false;
                    }
    })
    .timesOrMore(3).within(Time.seconds(60));

PatternStream<LogEvent> mflPatternStream = CEP.pattern(inputLogEventStream, mflPattern);

DataStream<Threat> outputMflStream = mflPatternStream.select(
    new PatternSelectFunction<LogEvent, Threat>() {
            public Threat select(Map<String, List<LogEvent>> logEventsMap) {
                    return new Threat("MULTIPLE FAILED LOGINS detected!");
            }
    });
outputMflStream.print();

Also reproduced below are print outputs when:

parallelism = 1 (It detected pattern successfully)

04/03/2018 12:03:53 Source: Custom File Source(1/1) switched to RUNNING 
04/03/2018 12:03:53 SelectCepOperator -> Sink: Unnamed(1/1) switched to RUNNING 
04/03/2018 12:03:53 Split Reader: Custom File Source -> Map -> Timestamps/Watermarks(1/1) switched to RUNNING 
04/03/2018 12:03:53 Sink: Unnamed(1/1) switched to RUNNING 
LogEvent [recordType=base18, eventCategory=login, user=paul, machine=laptop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:08Z, timeLong=1522103408000]
LogEvent [recordType=base19, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:03Z, timeLong=1522103403000]
LogEvent [recordType=base20, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:05Z, timeLong=1522103405000]
LogEvent [recordType=base21, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:06Z, timeLong=1522103406000]
**THREAT** ==> MULTIPLE FAILED LOGINS detected!

parallelism = 4 (It failed to detect pattern)

04/03/2018 12:05:33 Split Reader: Custom File Source -> Map -> Timestamps/Watermarks(3/4) switched to RUNNING 
04/03/2018 12:05:33 Split Reader: Custom File Source -> Map -> Timestamps/Watermarks(2/4) switched to RUNNING 
04/03/2018 12:05:33 Sink: Unnamed(2/4) switched to RUNNING 
04/03/2018 12:05:33 SelectCepOperator -> Sink: Unnamed(2/4) switched to RUNNING 
04/03/2018 12:05:33 Sink: Unnamed(3/4) switched to RUNNING 
04/03/2018 12:05:33 SelectCepOperator -> Sink: Unnamed(3/4) switched to RUNNING 
2> LogEvent [recordType=base18, eventCategory=login, user=paul, machine=laptop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:08Z, timeLong=1522103408000]
3> LogEvent [recordType=base21, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:06Z, timeLong=1522103406000]
3> LogEvent [recordType=base20, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:05Z, timeLong=1522103405000]
3> LogEvent [recordType=base19, eventCategory=login, user=deb, machine=desktop1, result=failed, eventCount=1, dataBytes=100, time=2018-03-26T22:30:03Z, timeLong=1522103403000]
  • Another point: If the watermark is set to something like 5 secs, then even with parallelism = 1, nothing is detected. But if watermark set to something like 2 secs or 0 secs, then at least parallelism = 1 works – Debajyoti Pathak Apr 03 '18 at 07:25
  • If you increase parallelism make sure that for each "partition" there is some data being read. If some parallel instance of the source does not read data, then resulting watermark(which is min of all inputs) does not progress. – Dawid Wysakowicz Apr 03 '18 at 08:01
  • @David - thanks. Taking cue from what you said, I am able to verify that for parallelism=2, it is working consistently. Each of the two threads is reading some data. For parallelism of 3 or 4, that was not always true. However, another issue is that even if I put the watermark = 1 sec, it does not work always. But always works for watermark=0. I guess some of these issues may be due to my very small data set, but I am still looking to understand why that is so. At least for testing, my data set will be small. – Debajyoti Pathak Apr 03 '18 at 09:23
  • 1
    Watermarking with 0 delay is rather pathological. That makes any out-of-order event late, and CEP drops all late events. – David Anderson Apr 03 '18 at 18:32

1 Answers1

0

I think what is happening here is the different partitions are get these events, when working with with CEP using .keyBy() is very important.

Your code

PatternStream<LogEvent> mflPatternStream = CEP.pattern(inputLogEventStream, mflPattern);

I believe should be

PatternStream<LogEvent> mflPatternStream = CEP.pattern(inputLogEventStream.keyBy("eventCategory","user"), mflPattern);

You might want to look at https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams

Mongo
  • 31
  • 5