Fixed window over Unbounded input (PubSub) stops firing after workers autoscale up

Question

using scio version 0.4.7, I have a streaming job that's listening to a pubsub topic, I'm using event processing here with 'timestamp' attribute present on the message properties in RFC3339

val rtEvents: SCollection[RTEvent] = sc.pubsubTopic(args("topic"), timestampAttribute = "timestamp").map(jsonToObject)
val windowedEvents = rtEvents.withFixedWindows(Duration.standardMinutes(1L),
  options = WindowOptions(trigger = Repeatedly.forever(AfterWatermark.pastEndOfWindow()),
    accumulationMode = DISCARDING_FIRED_PANES,
    allowedLateness = Duration.standardSeconds(1L)
  )
)

I use windowedEvents for further aggregation and calculations in the pipeline

doSomeAggregation(windowedEvents)

def doSomeAggregation(events: SCollection[RTEvent]): SCollection[(String, Map[String, Int])] =
        events.map(e => (e.properties.key, (e.properties.category, e.id)))
          .groupByKey
          .map { case (key, tuple: Iterable[(String, String)]) =>
            val countPerCategory: Map[String, Int] = tuple.groupBy(_._1)
              .mapValues(_.toList.distinct.size)
            //some other http post and logging here
            (key, countPerCategory)
          }

    sc.close().waitUntilFinish()

If i run the job with the following autoscaling parameters on google dataflow

--workerMachineType=n1-standard-8 --autoscalingAlgorithm=THROUGHPUT_BASED
--maxNumWorkers=4

the job runs and the fixed windows fire correctly if there is only one worker running. As soon as the job autoscales up to more > 1 worker, the fixed windows stop firing and initial pubsub step's system lag and wall time keeps growing, while data watermark does not move forward.

Is there something wrong with my trigger setup? Has anyone else experienced this on dataflow runner or other runners? Any help is greatly appreciated. I'm inclined to drop scio and revert to back to apache-beam java sdk if I can't solve this.

I don't see anything wrong with the application. It is mostly not an Scio issue. Do you have a job_id we can take a look at? Btw, did the pipeline work with multiple workers? (you can start it with two workers with --numWorkers=2 option). — Raghu Angadi, Feb 13 '18 at 21:05
Hey Raghu, thanks for the reply. Here's the job_id where i started with a single worker, and the problem occurred after the job upscaled to 4 workers : 2018-02-11_20_56_03-1568786864154833556. I did start the job with 2 workers, and it did not fire any windows at all - job id : 2018-02-11_20_43_27-4213713692706340214. I guess you work for google, if you'd like a formal ticket to debug this let me know. — ASingh, Feb 14 '18 at 02:43
Filing a formal ticket will be good. Please go head and mentioned this post. Meanwhile I will request permission to look at the job status. — Raghu Angadi, Feb 14 '18 at 23:39
This job ran only for 10 minutes. If you are using custom timestamp, it could take some time for Dataflow to scan the timestamp of the messages in pubsub and advance it. Did you have a lot of messages to process at the start? — Raghu Angadi, Feb 15 '18 at 23:33
if you're talking about the job 4213713692706340214, yes it ran only for 10 odd minutes before i killed it since I expected the window to fire in the first minute itself. I have a continuous stream of production data on the pub sub topic, so yes I did have a lot of messages to process at the start. I can give you another job id if you'd like where I start with a single worker and everything runs fine. — ASingh, Feb 16 '18 at 03:51
@raghu I've created a case for this (case number: 15088253) I've simplified my code and started a new job (id: 2018-02-24_16_20_46-6460854547312001582 ) with a single worker. Windows fired correctly till the job autoscaled 1-> 4. I'm going to leave this job running for a day or two. — ASingh, Feb 25 '18 at 01:00
@Amol, the simplest explanation might be that workers are not able to talk to each other for the shuffle. I looked at the job you mentioned, it was stopped this morning at 7am PST. The symptoms pointed to networking issue. That also explains why it didn't work with --numWorkers=2. Dataflow default permissions enables all ports between the workers I think, your admin might have changed it. I will find a link for relevant documentation. — Raghu Angadi, Feb 26 '18 at 21:45
You do seem to have a firewall rule for allowing all TCP ports between dataflow workers. Still not sure why they can't communicate. Any streaming job with more than one worker is likely affected by this issue. — Raghu Angadi, Feb 26 '18 at 23:01
That makes sense - I was leaning towards that explanation as my beam-sdk job also didn't work. I'll check my internal SRE team to see if something was changed. — ASingh, Feb 27 '18 at 02:06
@RaghuAngadi i have resolved the issue. You were right, workers were not able to talk to each other. The firewall rule setup for dataflow for our account was limited to our network, passing --network=our-network has resolved the issue. I appreciate your effort in resolving this. Thanks! — ASingh, Feb 27 '18 at 16:32
Great. Could you summarize it as an answer? It will useful for future reference. The comments are not easy to follow. So networ named 'default' didn't have this firewall rule? — Raghu Angadi, Feb 27 '18 at 21:10

score 2 · Answer 1 · answered Feb 28 '18 at 01:26

I managed to resolve the issue. In my current setup the workers were unable to communicate with each other. The job silently fails without any timeout errors (something beam should probably propagate up as an error).

If you're using dataflow as your runner, make sure the firewall defined for dataflow on your project is defined for 'default' network.

If the dataflow firewall is defined for your network, you will need to pass additional runtime parameter into your job

--workerMachineType=n1-standard-8 --autoscalingAlgorithm=THROUGHPUT_BASED --maxNumWorkers=4 --network='your-network'

Fixed window over Unbounded input (PubSub) stops firing after workers autoscale up

1 Answers1