My question is very similar to another post: Apache Beam Cloud Dataflow Streaming Stuck Side Input.
However, I tried the resolution there (apply GlobalWindows() to the side input), and it did not seem to fix my problem.
I have a Dataflow pipeline (but I'm using DirectRunner for debug) with Python SDK where the main input are logs from PubSub and the side input is associated data from a mostly unchanging database. I would like to join the two such that each log is paired with side input data from the same approximate time. Excess side inputs without an associated log can be dropped.
The behavior I see is that the pipeline seems to be operating as a single thread. It processes the all side input elements first, then starts processing the main input elements. If the side input is bounded (non-streaming), this is fine, and the pipeline can merge inputs and run to completion. If the side input is unbounded (streaming), however, the main input is blocked indefinitely while apparently waiting for the side input processing to finish.
To illustrate the behavior, I made simplified test case below.
class Logger(apache_beam.DoFn):
def __init__(self, name):
self._name = name
def process(self, element, w=apache_beam.DoFn.WindowParam,
ts=apache_beam.DoFn.TimestampParam):
logging.error('%s: %s', self._name, element)
yield element
def cross_join(left, rights):
for right in rights:
yield (left, right)
def main():
start = timestamp.Timestamp.now()
# Bounded side inputs work OK.
stop = start + 20
# Unbounded side inputs appear to block execution of main input
# processing.
#stop = timestamp.MAX_TIMESTAMP
side_interval = 5
main_interval = 1
side_input = (
pipeline
| PeriodicImpulse(
start_timestamp=start,
stop_timestamp=stop,
fire_interval=side_interval,
apply_windowing=True)
| apache_beam.Map(lambda x: ('side', x))
| apache_beam.ParDo(Logger('side_input'))
)
main_input = (
pipeline
| PeriodicImpulse(
start_timestamp=start, stop_timestamp=stop,
fire_interval=main_interval, apply_windowing=True)
| apache_beam.Map(lambda x: ('main', x))
| apache_beam.ParDo(Logger('main_input'))
| 'CrossJoin' >> apache_beam.FlatMap(
cross_join, rights=apache_beam.pvalue.AsIter(side_input))
| 'CrossJoinLogger' >> apache_beam.ParDo(Logger('cross_join_output'))
)
pipeline.run()
Am missing something that is preventing main inputs from being processed in parallel with the side inputs?