1

I am using ProcessFunction in pyflink (1.15.0) job. One of the use case is to filter out wrong input to different kafka topic.

In java, we use OutputTag to redirect those inputs to another stream and then to different sink. In pyflink 1.15.0 I see that outputTag is not supported yet and I can see them being introduced in 1.16.0-SNAPSHOT version.

Is there any other way we can still do this in 1.15.0 ? As, I can not move to 1.16.0-SNAPSHOT version.

Sample code:

class MyProcessFunction(ProcessFunction):

    def process_element(self, value, ctx: 'ProcessFunction.Context'):
        yield ProcessingResponse(value[0])

def run_job():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(5 * 60 * 1000)
    env.get_checkpoint_config().set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)
    env.get_checkpoint_config().set_min_pause_between_checkpoints(30 * 1000)
    env.get_checkpoint_config().set_checkpoint_timeout(10 * 1000)
    env.get_checkpoint_config().set_tolerable_checkpoint_failure_number(2)
    env.get_checkpoint_config().set_externalized_checkpoint_cleanup(
        ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
    env.get_checkpoint_config().set_max_concurrent_checkpoints(1)
    env.get_checkpoint_config().enable_unaligned_checkpoints()
    env.get_checkpoint_config().set_checkpoint_storage(FileSystemCheckpointStorage("file:///flink-checkpoints"))
    env.add_jars("file:///jar_files//flink-sql-connector-kafka-1.15.0.jar")
    deserialization_schema = JsonRowDeserializationSchema.builder().type_info(
        type_info=Types.ROW_NAMED(["id", "eventTime", "timeoutInMillis"],
                                  [Types.STRING(), Types.LONG(), Types.LONG()])).build()

    serialization_schema = JsonRowSerializationSchema.builder().with_type_info(
        type_info=Types.ROW_NAMED(["id"], [Types.STRING()])).build()

    source = FlinkKafkaConsumer(topics="test-input",
                                deserialization_schema=deserialization_schema,
                                properties={'bootstrap.servers': 'localhost:9092',
                                            'group.id': 'job1',
                                            'auto.offset.reset': 'earliest'})

    sink = FlinkKafkaProducer(topic="test-output",
                              serialization_schema=serialization_schema,
                              producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})

    ds1 = env.add_source(source, "kafka-source")
    ds2 = ds1.process(MyProcessFunction(), Types.ROW_NAMED(["id"], [Types.STRING()])).name(
        "task1").disable_chaining()

    ds2.add_sink(sink).name("kafka-sink")

    env.execute("job1")


if __name__ == '__main__':
    run_job()

How do I introduce the sideoutput or similar functionality in this code such that any exception value in process_element method of MyProcessFunction can be redirected to another kafka topic

Lakshya Garg
  • 736
  • 2
  • 8
  • 23
  • 1
    This question was asked and answered on the flink user mailing list. See https://lists.apache.org/thread/28s49bp47nt5zjj4sdlkooym7z39tsq8 – David Anderson May 23 '22 at 14:05

0 Answers0