0

Our usecase is we want to use flink streaming for a de-duplicator job, which reads it's data from source(kafka topic) and writes unique records into hdfs file sink. Kafka topic could have duplicate data, which can be identified by using composite key (adserver_id, unix_timestamp of the record)

so I decided to use flink keyed state stream to achieve de-duplication.

val messageStream: DataStream[String] = env.addSource(flinkKafkaConsumer)

messageStream
  .map{
    record =>
      val key = record.adserver_id.get + record.event_timestamp.get
      (key,record)
  }
  .keyBy(_._1)
  .flatMap(new DedupDCNRecord())
  .map(_.toString)
  .addSink(sink)

  // execute the stream
  env.execute(applicationName)
}

Here is the code for de-duplication using value state from flink.

class DedupDCNRecord extends RichFlatMapFunction[(String, DCNRecord), DCNRecord] {
  private var operatorState: ValueState[String] = null

  override def open(configuration: Configuration) = {
    operatorState = getRuntimeContext.getState(
      DedupDCNRecord.descriptor
    )
  }

  @throws[Exception]
  override def flatMap(value: (String,DCNRecord), out: Collector[DCNRecord]): Unit = {

    if (operatorState.value == null) { // we haven't seen the element yet
      out.collect(value._2)
      // set operator state to true so that we don't emit elements with this key again
      operatorState.update(value._1)
    }
  }
}

While this approach works fine as long as streaming job is running and maintaining list of unique keys through valueState and performing de-duplication. But as soon as I cancel the job, flink looses it's state(unique keys seen in previous run of the job) for valueState(only keeps unique keys for the current run) and let the records pass, which were already processed in previous run of the job. Is there a way, we can enforce flink to mainatain it's valueState(unique_keys) seen so far ? Appreciate your help.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
S Mishra
  • 57
  • 10
  • I assume you have checkpointing enabled, and you've got a group.id set for your Kafka consumer, so that Flink saves the consumer offset (per Kafka topic partition) as part of the checkpoint state. – kkrugler Nov 06 '22 at 17:53
  • 1
    Separately, you don't need to extract the key into a Tuple2 field - assuming you had some `.getKey()` method in your record, just just `.keyBy(r -> r.getKey())` (for Java). In your `DedupDCNRecord` function you don't need to save the key in state, your state is keyed by this value, so just use something like `ValueState`. – kkrugler Nov 06 '22 at 17:55
  • @kkrugler yes, I've check pointing enabled in my job through embeded rocksDB. However it seems check pointing only helps in restoring operator state. Keyed state per task is maintained separately and can only be restored between different jobs runs from savepoint( h/t David Anderson) – S Mishra Nov 07 '22 at 02:17
  • 1
    You can use retained checkpoints for restarts -- you don't have to use savepoints. Both kinds of snapshots contain both operator and keyed state. See https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/ops/state/checkpoints_vs_savepoints/ for more info. – David Anderson Nov 07 '22 at 11:43
  • @DavidAnderson I tried to use retained checkpoint for maintaining the keyed state, but it doesn't seem to work. I also don't see _metadata dir in retained checkpoint which I saw in savepoint. am I missing something ? ```val checkpointConfig = env.getCheckpointConfig env.setStateBackend(new EmbeddedRocksDBStateBackend()) checkpointConfig.setCheckpointStorage(checkPointDir) checkpointConfig.setExternalizedCheckpointCleanup(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) checkpointConfig.setCheckpointStorage(new FileSystemCheckpointStorage(checkPointDir))``` – S Mishra Nov 15 '22 at 03:10
  • It doesn't make sense to call both `checkpointConfig.setCheckpointStorage(checkPointDir)` and `checkpointConfig.setCheckpointStorage(new FileSystemCheckpointStorage(checkPointDir))` -- and I'm not sure what effect that has. – David Anderson Nov 15 '22 at 09:55
  • What makes you think it isn't working? – David Anderson Nov 15 '22 at 09:55
  • ty , I got confused here https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/state_backends/#rocksdbstatebackend. The reason I say it's not working, cause when I resume my job from retained checkpoint(using -s) similar to savepoint, my keyed state didn't prevent duplicates write into sink. I forgot to mention I had also enabled incremental checkpoint in flink-conf.yaml file. Also is it expected not to have _metadata dir in checkpoint, which I saw in savepoint ? – S Mishra Nov 16 '22 at 02:25
  • I think I figured out, structure of dir are different in checkpoint vs savepoint. In checkpoint _metadata dir is nested under checkpoint dir e.g. ```/user-defined-checkpoint-dir/{job-id}/chk-1/_metadata``` vs in savpoint it's at job level dir e.g. /user-defined-checkpoint-dir/{job-id}/_metadata I am able to resume my keyed state from retained checkpoint now. thanks @DavidAnderson for your help. – S Mishra Nov 16 '22 at 22:04

1 Answers1

1

This requires you capture a snapshot of the state before shutting down the job, and then restart from that snapshot:

  1. Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
  2. Relaunch, using the savepoint as the starting point.

For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground. The section on Observing Failure & Recovery is also relevant here.

David Anderson
  • 39,434
  • 4
  • 33
  • 60