5

I implemented a flink stream with a BroadcastProcessFunction. From the processBroadcastElement I get my model and I apply it on my event in processElement.

I don't find a way to unit test my stream as I don't find a solution to ensure the model is dispatched prior to the first event. I would say there are two ways for achieving this:
1. Find a solution to have the model pushed in the stream first
2. Have the broadcast state filled with the model prio to the execution of the stream so that it is restored

I may have missed something, but I have not found an simple way to do this.

Here is a simple unit test with my issue:

import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
import org.scalatest.Matchers._
import org.scalatest.{BeforeAndAfter, FunSuite}

import scala.collection.mutable


class BroadCastProcessor extends BroadcastProcessFunction[Int, (Int, String), String] {

  import BroadCastProcessor._

  override def processElement(value: Int,
                              ctx: BroadcastProcessFunction[Int, (Int, String), String]#ReadOnlyContext,
                              out: Collector[String]): Unit = {
    val broadcastState = ctx.getBroadcastState(broadcastStateDescriptor)

    if (broadcastState.contains(value)) {
      out.collect(broadcastState.get(value))
    }
  }

  override def processBroadcastElement(value: (Int, String),
                                       ctx: BroadcastProcessFunction[Int, (Int, String), String]#Context,
                                       out: Collector[String]): Unit = {
    ctx.getBroadcastState(broadcastStateDescriptor).put(value._1, value._2)
  }
}

object BroadCastProcessor {
  val broadcastStateDescriptor: MapStateDescriptor[Int, String] = new MapStateDescriptor[Int, String]("int_to_string", classOf[Int], classOf[String])
}

class CollectSink extends SinkFunction[String] {

  import CollectSink._

  override def invoke(value: String): Unit = {
    values += value
  }
}

object CollectSink { // must be static
  val values: mutable.MutableList[String] = mutable.MutableList[String]()
}

class BroadCastProcessTest extends FunSuite with BeforeAndAfter {

  before {
    CollectSink.values.clear()
  }

  test("add_elem_to_broadcast_and_process_should_apply_broadcast_rule") {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val dataToProcessStream = env.fromElements(1)

    val ruleToBroadcastStream = env.fromElements(1 -> "1", 2 -> "2", 3 -> "3")

    val broadcastStream = ruleToBroadcastStream.broadcast(BroadCastProcessor.broadcastStateDescriptor)

    dataToProcessStream
      .connect(broadcastStream)
      .process(new BroadCastProcessor)
      .addSink(new CollectSink())

    // execute
    env.execute()

    CollectSink.values should contain("1")
  }
}

Update thanks to David Anderson
I went for the buffer solution. I defined a process function for the synchronization:

class SynchronizeModelAndEvent(modelNumberToWaitFor: Int) extends CoProcessFunction[Int, (Int, String), Int] {
  val eventBuffer: mutable.MutableList[Int] = mutable.MutableList[Int]()
  var modelEventsNumber = 0

  override def processElement1(value: Int, ctx: CoProcessFunction[Int, (Int, String), Int]#Context, out: Collector[Int]): Unit = {
    if (modelEventsNumber < modelNumberToWaitFor) {
      eventBuffer += value
      return
    }
    out.collect(value)
  }

  override def processElement2(value: (Int, String), ctx: CoProcessFunction[Int, (Int, String), Int]#Context, out: Collector[Int]): Unit = {
    modelEventsNumber += 1

    if (modelEventsNumber >= modelNumberToWaitFor) {
      eventBuffer.foreach(event => out.collect(event))
    }
  }
}

And so I need to add it to my stream:

dataToProcessStream
  .connect(ruleToBroadcastStream)
  .process(new SynchronizeModelAndEvent(3))
  .connect(broadcastStream)
  .process(new BroadCastProcessor)
  .addSink(new CollectSink())

Thanks

Matthieu
  • 85
  • 4
  • Thanks for asking the question and sharing the buffer solution. – Mohammad Reza Esmaeilzadeh Sep 26 '21 at 05:56
  • I think the solution is not so straight forward as it does not guarantee that the broadcast event has been processed downstream. It just guarantees that the SynchronizeModelAndEvent has received the event not the actual BroadCastProcessor – emilio Feb 02 '23 at 15:01

2 Answers2

3

There isn't an easy way to do this. You could have processElement buffer all of its input until the model has been received by processBroadcastElement. Or run the job once with no event traffic and take a savepoint once the model has been broadcast. Then restore that savepoint into the same job, but with its event input connected.

By the way, the capability you are looking for is often referred to as "side inputs" in the Flink community.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
0

Thanks to David Anderson and Matthieu I wrote this generic CoFlatMap function that makes the requested delay on the event stream:

import org.apache.flink.streaming.api.functions.co.CoProcessFunction
import org.apache.flink.util.Collector

import scala.collection.mutable

class SynchronizeEventsWithRules[A,B](rulesToWait: Int) extends CoProcessFunction[A, B, A] {
  val eventBuffer: mutable.MutableList[A] = mutable.MutableList[A]()
  var processedRules = 0
  override def processElement1(value: A, ctx: CoProcessFunction[A, B, A]#Context, out: Collector[A]): Unit = {
    if (processedRules < rulesToWait) {
      println("1 item buffered")
      println(rulesToWait+"--"+processedRules)
      eventBuffer += value
      return
    }
    eventBuffer.clear()
    println("send input to output without buffering:")
    out.collect(value)
  }

  override def processElement2(value: B, ctx: CoProcessFunction[A, B, A]#Context, out: Collector[A]): Unit = {
    processedRules += 1
    println("1 rule processed processedRules:"+processedRules)
    if (processedRules >= rulesToWait && eventBuffer.length>0) {
      println("send buffered data to output")
      eventBuffer.foreach(event => out.collect(event))
      eventBuffer.clear()
    }
  }
}

but unfortunately, it does not help at all in my case, because the subject under the test was a KeyedBroadcastProcessFunction that makes the delay on event data irrelevant because of that I tried to apply a flatmap that makes the rule stream n times larger, n was the number of CPUs. so I will sure that the resulting event stream will be always sync with the rule stream and will be arrived after that but it does not help either.

after all, I came to this simple solution that of course it is not deterministic but because of the nature of parallelism and concurrency the problem itself is not deterministic either. If we set the delayMilis big enough (>100) the result will be deterministic:

val delayMilis = 100
val synchronizedInput = inputEventStream.map(x=>{
      Thread.sleep(delayMilis)
      x
    }).keyBy(_.someKey)

you can also change the mapping function to this to apply the delay only on the first element:

package util

import org.apache.flink.api.common.functions.MapFunction

class DelayEvents[T](delayMilis: Int) extends MapFunction[T,T] {
  var delayed = false
  override def map(value: T): T = {
        if (!delayed) {
          delayed=true
          Thread.sleep(delayMilis)
        }
        value
  }
} 
val delayMilis = 100
val synchronizedInput = inputEventStream.map(new DelayEvents(100)).keyBy(_.someKey)