3

LE: TL;DR; How do I create an unbounded data source in Python? Is it possible ?

I'm building a streaming dataflow which will continuosly processes float values coming from sensors which have a timestamp, id, and a reading value, put the values in FixedWindows of 2 seconds, then output an aggregation.

Code link: https://gist.github.com/nicolaerosia/51981c600dacab4c021d99c0ce838b79

Here's the pipeline for quick vis:

    files = [
        "in.csv",
    ]

    fields = (p | beam.Create(files).with_output_types(str)
                | beam.ParDo(FileReader())
                | "ParseRawLine" >> beam.ParDo(ParseRawLine())
                | "AddEventTimestamp" >> beam.Map(lambda elem: beam.window.TimestampedValue(elem, elem['timestamp']))
                | "window" >> beam.WindowInto(windowfn=beam.transforms.window.FixedWindows(2),
                    trigger=AfterWatermark(late=AfterProcessingTime(1)),
                    accumulation_mode=AccumulationMode.DISCARDING,
                )
                | "MapID" >> beam.Map(lambda x: (x['id'], x['value']))
                | beam.GroupByKey()
                | "DummyWindowPrint" >> beam.ParDo(DummyWindowPrint())
    )

The problem I have is that the operation after GroupByKey only starts after the input finishes, however I want to use a trigger on WindowInto which will fire after 1 second after the last entry in the window arrives.

Example output:

DEBUG:root:ParseRawLine(140117071822672): entry: {'timestamp': Timestamp(1583964059.983996), 'id': 79, 'value': '0.6312056749059605'}
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DataInputOperation Create/Impulse receivers=[SingletonConsumerSet[Create/Impulse.out0, coder=WindowedValueCoder[BytesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation Create/FlatMap(<lambda at core.py:2597>) output_tags=['out'], receivers=[SingletonConsumerSet[Create/FlatMap(<lambda at core.py:2597>).out0, coder=WindowedValueCoder[BytesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation Create/Map(decode) output_tags=['out'], receivers=[SingletonConsumerSet[Create/Map(decode).out0, coder=WindowedValueCoder[StrUtf8Coder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation ParDo(FileReader) output_tags=['out'], receivers=[SingletonConsumerSet[ParDo(FileReader).out0, coder=WindowedValueCoder[StrUtf8Coder], len(consumers)=1]]>
DEBUG:root:FileReader(140117071413776): finish_bundle
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation ParseRawLine output_tags=['out'], receivers=[SingletonConsumerSet[ParseRawLine.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation AddEventTimestamp output_tags=['out'], receivers=[SingletonConsumerSet[AddEventTimestamp.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation window output_tags=['out'], receivers=[SingletonConsumerSet[window.out0, coder=WindowedValueCoder[FastPrimitivesCoder], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DoOperation MapID output_tags=['out'], receivers=[SingletonConsumerSet[MapID.out0, coder=WindowedValueCoder[TupleCoder[LengthPrefixCoder[FastPrimitivesCoder], LengthPrefixCoder[FastPrimitivesCoder]]], len(consumers)=1]]>
DEBUG:apache_beam.runners.worker.bundle_processor:finish <DataOutputOperation GroupByKey/Write >
DEBUG:apache_beam.runners.portability.fn_api_runner:Wait for the bundle bundle_1 to finish.

<!!!!!! HERE !!!!!!>
INFO:apache_beam.runners.portability.fn_api_runner:Running (GroupByKey/Read)+(ref_AppliedPTransform_DummyWindowPrint_16)
<!!!!!! HERE !!!!!!>

DEBUG:apache_beam.runners.worker.sdk_worker:Got work control_10
DEBUG:apache_beam.runners.worker.sdk_worker:Got work control_9

Beam versions tested:

  • master (2020 March 12)

  • release-2.19.0

  • release-2.20.0

Invocations:

python3 beam_issues.py \
--streaming \
--runner=DirectRunner

OR

python3 beam_issues.py \
--streaming \
--runner=DirectRunner \
--direct_num_workers=8 \
--direct_running_mode=multi_threading

What am I missing?

  • Hi did you find a solution? I really struggle with streaming and grouping/combining. it seems like the window never closes. my code is very similar to yours – Edmundo Del Gusto Nov 26 '20 at 09:23
  • 2
    Hello, thank you for asking - unfortunately no, I have stopped using Beam. I have also reported on JIRA, chat but never received any answer at all. You are on your own. I would personally suggest looking at something that is more popular even though it’s not that attractive like Beam. – Nicolae Rosia Nov 27 '20 at 10:05
  • 1
    yes I'm so annoyed by the exceptionally bad documentation. can you recommend a python pipeline library that supports windowing? – Edmundo Del Gusto Nov 27 '20 at 15:35

1 Answers1

0

Setting a trigger like this will solve the problem.

trigger=Repeatedly(
 AfterAny(
  AfterCount(100),
  AfterProcessingTime(1)
 )
)
Dharman
  • 30,962
  • 25
  • 85
  • 135
sees
  • 195
  • 2
  • 4
  • 13