LE: TL;DR; How do I create an unbounded data source in Python? Is it possible ?
I'm building a streaming dataflow which will continuosly processes float values coming from sensors which have a timestamp, id, and a reading value, put the values in FixedWindows
of 2 seconds, then output an aggregation.
Code link: https://gist.github.com/nicolaerosia/51981c600dacab4c021d99c0ce838b79
Here's the pipeline for quick vis:
files = [
"in.csv",
]
fields = (p | beam.Create(files).with_output_types(str)
| beam.ParDo(FileReader())
| "ParseRawLine" >> beam.ParDo(ParseRawLine())
| "AddEventTimestamp" >> beam.Map(lambda elem: beam.window.TimestampedValue(elem, elem['timestamp']))
| "window" >> beam.WindowInto(windowfn=beam.transforms.window.FixedWindows(2),
trigger=AfterWatermark(late=AfterProcessingTime(1)),
accumulation_mode=AccumulationMode.DISCARDING,
)
| "MapID" >> beam.Map(lambda x: (x['id'], x['value']))
| beam.GroupByKey()
| "DummyWindowPrint" >> beam.ParDo(DummyWindowPrint())
)
The problem I have is that the operation after GroupByKey
only starts after the input finishes, however I want to use a trigger on WindowInto
which will fire after 1 second after the last entry in the window arrives.
Example output:
DEBUG:root:ParseRawLine(140117071822672): entry: {'timestamp': Timestamp(1583964059.983996), 'id': 79, 'value': '0.6312056749059605'}
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DataInputOperation Create/Impulse receivers=[SingletonConsumerSet[Create/Impulse.out0, coder=WindowedValueCoder[BytesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation Create/FlatMap(<lambda at core.py:2597>) output_tags=['out'], receivers=[SingletonConsumerSet[Create/FlatMap(<lambda at core.py:2597>).out0, coder=WindowedValueCoder[BytesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation Create/Map(decode) output_tags=['out'], receivers=[SingletonConsumerSet[Create/Map(decode).out0, coder=WindowedValueCoder[StrUtf8Coder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation ParDo(FileReader) output_tags=['out'], receivers=[SingletonConsumerSet[ParDo(FileReader).out0, coder=WindowedValueCoder[StrUtf8Coder], len(consumers)=1]]>
DEBUG:root:FileReader(140117071413776): finish_bundle
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation ParseRawLine output_tags=['out'], receivers=[SingletonConsumerSet[ParseRawLine.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation AddEventTimestamp output_tags=['out'], receivers=[SingletonConsumerSet[AddEventTimestamp.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation window output_tags=['out'], receivers=[SingletonConsumerSet[window.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation MapID output_tags=['out'], receivers=[SingletonConsumerSet[MapID.out0, coder=WindowedValueCoder[TupleCoder[LengthPrefixCoder[FastPrimitivesCoder], LengthPrefixCoder[FastPrimitivesCoder]]], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DataOutputOperation GroupByKey/Write >
DEBUG:apache_beam.runners.portability.fn_api_runner:Wait for the bundle bundle_1 to finish.
<!!!!!! HERE !!!!!!>
INFO:apache_beam.runners.portability.fn_api_runner:Running (GroupByKey/Read)+(ref_AppliedPTransform_DummyWindowPrint_16)
<!!!!!! HERE !!!!!!>
DEBUG:apache_beam.runners.worker.sdk_worker:Got work control_10
DEBUG:apache_beam.runners.worker.sdk_worker:Got work control_9
Beam versions tested:
master (2020 March 12)
release-2.19.0
release-2.20.0
Invocations:
python3 beam_issues.py \
--streaming \
--runner=DirectRunner
OR
python3 beam_issues.py \
--streaming \
--runner=DirectRunner \
--direct_num_workers=8 \
--direct_running_mode=multi_threading
What am I missing?