1

I am attempting to use FinishBundle() to batch requests in beam on dataflow. These requests are fetching information and emitting it for further processing downstream in the pipeline, a la:

func BatchRpcFn {
  client RpcClient
  bufferRequest *RpcRequest
}

func (f *BatchRpcFn) Setup(ctx context.Context) {
  // setup client
}

func (f *BatchRpcFn) ProcessBundle(ctx context.Context, id string, emit func(string, bool)) error {
  f.bufferRequest.Ids = append(f.bufferRequest.Ids, id)
  if len(f.bufferRequest.Ids) > bufferLimit {
    return f.performRequestAndEmit(ctx, emit)
  }
  return nil
}

func (f *BatchRpcFn) FinishBundle(ctx context.Context, emit func(string, bool)) error {
  return f.performRequestAndEmit(ctx, emit)
}

In unit tests, this function works as expected, however when running on dataflow, I get this error:

panic: interface conversion: typex.Window is window.GlobalWindow, not window.IntervalWindow
//...
github.com/apache/beam/sdks/v2/go/pkg/beam/core/runtime/exec.(*intervalWindowEncoder).EncodeSingle()

The documentation on FinishBundle() is a little sparse, so it wasn't clear to me if this is possible. Most of the examples I see of using FinishBundle() are flushing data to some sink instead of adding to the resultant PCollection.

Is this a bug, or am I using FinishBundle incorrectly here?

1 Answers1

1

I think that the processing should be done in ProcessElement() itself which produces the resultant PCollection. StartBundle() and FinishBundle() are one time calls per bundle that have common use-case of connecting/disconnecting to the external service/database, etc.

I guess that having a stateful DoFn to batch the requests may be a good way to do so. For example, Do processing after five elements have been observed, and finally onTimer() callback to process the remaining elements at the end of window.

However, only State support has been added to the Go SDK for 2.42.0 release. Timers are yet to be implemented.

  • Thanks @ritesh-ghorse! I see you're the one who is working on the state and timers for the go SDK, so thank you for that as well! As far as state and timers, that seems like a good option, but I guess I'll have to wait until you get the timer code merged to do it exactly as you're suggesting. Any other thoughts as to how I could do batched RPCs like the above? – Cam Phillips Oct 11 '22 at 15:38
  • Can you add full stacktrace to help debug either here or you can file an issue at https://github.com/apache/beam/issues with more details – Ritesh Ghorse Oct 11 '22 at 17:56
  • Actually, it looks like according to https://beam.apache.org/documentation/transforms/python/elementwise/pardo/, the output of `FinishBundle()` needs to be a windowed type, which would make this error make sense. – Cam Phillips Nov 01 '22 at 22:28