9

I have implemented MapFunction for my Apache Flink flow. It is parsing incoming elements and convert them to other format but sometimes error can appear (i.e. incoming data is not valid).

I see two possible ways how to handle it:

  • Ignore invalid elements but seems like I can't ignore errors because for any incoming element I must provide outgoing element.
  • Split incoming elements to valid and invalid but seems like I should use other function for this.

So, I have two questions:

  1. How to handle errors correctly in my MapFunction?
  2. How to implement such transformation functions correctly?
Maxim
  • 9,701
  • 5
  • 60
  • 108

2 Answers2

9

You could use a FlatMapFunction instead of a MapFunction. This would allow you to only emit an element if it is valid. The following shows an example implementation:

input.flatMap(new FlatMapFunction<String, Long>() {
    @Override
    public void flatMap(String input, Collector<Long> collector) throws Exception {
        try {
            Long value = Long.parseLong(input);
            collector.collect(value);
        } catch (NumberFormatException e) {
            // ignore invalid data
        }
    }
});
Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51
  • But ignoring the invalid data is not an option for a lot of uses cases, What I want to do instead is forward the message to a different data sink for further examination. Does anybody has a good idea on how to accomplish that? – Abiy Legesse Hailemichael Sep 12 '16 at 17:42
  • 4
    You could introduce a wrapper type which can contain valid and invalid values. Then you could use the `split` + `select` function to split the stream into a failure stream and a correct value stream which you can write to a different sink. – Till Rohrmann Sep 15 '16 at 08:38
  • Till's suggestion is great, and sounds like the basis for a general improvement where any operator could have an `exceptionally` side-output. – Eron Wright Jul 19 '17 at 21:13
  • @TillRohrmann - could you please expand on your split + select idea? At first I thought it would be by using KeyBy() function that splits the stream into different partitions (one of them being for "bad data"). But then this would mean the downstream operators would need to know about it too. I think any solution that limits the bad data as early as possible in the graph will be a good thing. Thanks! – victtim Mar 06 '18 at 14:39
2

This is to build on @Till Rohrmann's idea above. Adding this as an answer instead of a comment for better formatting.

I think one way to implement "split + select" could be to use a ProcessFunction with a SideOutput. My graph would look something like this:

Source --> ValidateProcessFunction ---good data--> UDF--->SinkToOutput
                                    \
                                     \---bad data----->SinkToErrorChannel

Would this work? Is there a better way?

victtim
  • 790
  • 5
  • 17