2

Running a streaming dataflow pipeline with quite a advanced group by using session windows I run into problems after a couple of hours of running. The job scales up in workers, but later starts getting load of logs with the following

Processing lull for PT7500.005S in state process of ...

The transformation that logs this code is right after the "group by"-block and executes an async HTTP call (using scala.concurrent.{Await/Promise}) to an external service.

Any ideas why this happens? Related to async, scaling or group by strategy?

  • Job ID: 2018-01-29_03_13_40-12789475517328084866
  • SDK: Apache Beam SDK for Java 2.2.0
  • Scio version: 0.4.7
Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
Brodin
  • 197
  • 1
  • 1
  • 14
  • 1
    This might be related to calling out to the HTTP service asynchronously. I've experienced similar issues related to this. As a test, you can try calling the service synchronously. You won't get nearly as high throughput, but you may be able to determine if the issue is related to the async call. – Andrew Nguonly Jan 29 '18 at 16:17
  • Could it be that you are overloading the server that you're talking to via HTTP? – Pablo Jan 30 '18 at 01:16
  • @Andrew: I will surely try this, the reason for me to use async in the first place was both to get better throughput and be able to use a retry-logic for http server errors. Do you have any recommendations for a good substitute for this? – Brodin Jan 30 '18 at 09:10
  • @Pablo: Well, the throughput is pretty high, but that shouldn't be a problem since the service I talk to is auto scaled to infinity and beyond. However, if I overloaded the service – why would beam act this way? – Brodin Jan 30 '18 at 09:12
  • @Brodin, one thing that I experimented with was configuring the number of threads used for the `ExecutionContextExecutorService`. This allowed me to control the number of concurrent requests to the service. If the service became overloaded, I could turn down the number of threads. Unfortunately, there's no good substitute for async calls to services. The alternative is to include the service logic as a transform (i.e. calling out directly to a database). I also experimented with implementing the Dataflow job in Node.js, which is built for async functionality. – Andrew Nguonly Jan 30 '18 at 13:18
  • @Brodin, to answer your second question, Dataflow may throttle the job (reduce input) if it determines that it is not achieving high output. If the service is returning errors from being overloaded, Dataflow will continually retry the request and may ultimately stop accepting input if it continues to receive error responses. Just to confirm, is the Dataflow job becoming throttled? – Andrew Nguonly Jan 30 '18 at 13:25
  • "Processing lull" is not an error - it's just debugging information supplied by Dataflow to help you debug your slow `DoFn`'s. It's shown if a DoFn is processing an element for more than 5 minutes (I think). If the DoFn is expected to be slow, you can ignore this message. If it's not expected - the processing lull message tells you exactly what the DoFn is currently doing so you can debug it. – jkff Jan 30 '18 at 22:49
  • @Andrew, are you using `ExecutionContextExecutorService` together with async requests or with synchronous? – Brodin Feb 03 '18 at 07:54
  • @jkff, thanks for helping me debug this. First step was to add a timeout to the scala future – which showed me that the "Processing lull" was actually promises which never terminated, thus forcing dataflow to keep them around "forever". Now I get proper future timeout errors, but to no avail since the job is still not going forward. Changed to synchronous calls now, but I am seeing a much lower throughput. – Brodin Feb 03 '18 at 07:58
  • @Brodin, I’m using `ExecutionContextExecutorService` for async requests. – Andrew Nguonly Feb 03 '18 at 17:02
  • @Brodin, I believe what jkff's suggestion shed a light on how to debug the "Processing lull" you're seeing, are you making further progress on this or are you still facing the same message afterwards? – JL-HaiNan Feb 07 '18 at 21:29
  • I removed the async handling and things are working better now, but I am seeing lower throughput. Thanks for your help guys! – Brodin Feb 22 '18 at 08:53
  • Having the same logs for a df-step that upserts docs to an elasticsearch server. The server seems to be stalling (no writes are allowed) and the job is also stalling since that occured on the server. It seems like its possible to have external reasons for this job to halt at this step. – Malte Jul 02 '18 at 08:15

1 Answers1

1

@jkff comment pointed me in the right direction. First step was to add a timeout to the scala future – which showed me that the "Processing lull" was actually promises which never terminated, thus forcing dataflow to keep them around "forever". Now I get proper future timeout errors, but to no avail since the job is still not going forward. Changed to synchronous calls now, but I am seeing a much lower throughput

Brodin
  • 197
  • 1
  • 1
  • 14