1

I'm trying to count kafka message key, by using direct runner.

If I put max_num_records =20 in ReadFromKafka, I can see results printed or outputed to text. like:

('2102', 5)
('2706', 5)
('2103', 5)
('2707', 5)

But without max_num_records, or if max_num_records is larger than message count in kafka topic, the program keeps running but nothing is outputed. If I try to output with beam.io.WriteToText, there will be an empty temp folder created, like: beam-temp-StatOut-d16768eadec511eb8bd897b012f36e97

Terminal shows:

2.30.0: Pulling from apache/beam_java8_sdk
Digest: sha256:720144b98d9cb2bcb21c2c0741d693b2ed54f85181dbd9963ba0aa9653072c19
Status: Image is up to date for apache/beam_java8_sdk:2.30.0
docker.io/apache/beam_java8_sdk:2.30.0

If I put 'enable.auto.commit': 'true' in kafka consumer config, the messages are commited, other clients from the same group can't read them, so I assume it's reading succesfully, just not processing or outputing.

I tried Fixed-time, Sliding time windowing, with or without different trigger, nothing changes.

Tried flink runner, got same result as direct runner.

No idea what I did wrong, any help?

environment: centos 7

anaconda

python 3.8.8

java 1.8.0_292

beam 2.30

code as below:

direct_options = PipelineOptions([
    "--runner=DirectRunner",
    "--environment_type=LOOPBACK",
    "--streaming",
])
direct_options.view_as(SetupOptions).save_main_session = True
direct_options.view_as(StandardOptions).streaming = True

conf = {'bootstrap.servers': '192.168.75.158:9092',
        'group.id': "g17",
        'enable.auto.commit': 'false',
        'auto.offset.reset': 'earliest'}

if __name__ == '__main__':
    with beam.Pipeline(options = direct_options) as p:
        msg_kv_bytes = ( p
            | 'ReadKafka' >> ReadFromKafka(consumer_config=conf,topics=['LaneIn']))
        messages = msg_kv_bytes | 'Decode' >> beam.MapTuple(lambda k, v: (k.decode('utf-8'), v.decode('utf-8')))
        counts = (
            messages
            | beam.WindowInto(
                window.FixedWindows(10),
                trigger = AfterCount(1),#AfterCount(4),#AfterProcessingTime
                # allowed_lateness=3,
                accumulation_mode = AccumulationMode.ACCUMULATING) #ACCUMULATING #DISCARDING
            # | 'Windowsing' >> beam.WindowInto(window.FixedWindows(10, 5))
            | 'TakeKeyPairWithOne' >> beam.MapTuple(lambda k, v: (k, 1))
            | 'Grouping' >> beam.GroupByKey()
            | 'Sum' >> beam.MapTuple(lambda k, v: (k, sum(v)))
        )
        output = (
            counts
            | 'Print' >> beam.ParDo(print)
            # |  'WriteText' >> beam.io.WriteToText('/home/StatOut',file_name_suffix='.txt')
        )

1 Answers1

1

There are couple of known issues that you might be running into. Beam's portable DirectRunner currently does not fully support streaming. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-7514 Beam's portable runners (including DirectRunner) has a known issue where streaming sources do not properly emit messages. Hence max_num_records or max_read_time arguments have to be provided to convert such sources to bounded sources. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-11998.

chamikara
  • 1,896
  • 1
  • 9
  • 6
  • As I stated in the post, I also tried flink runner, got same result as direct runner. Does the flink runner currently have the same issue as direct runner? – CannonFodder Jul 30 '21 at 01:19
  • @CannonFodder you'll also need to set `experiments=["use_deprecated_read"]` along with at least one of `max_rum_records` or `max_read_time` – Jon.H Sep 22 '21 at 19:59