Google dataflow job which reads from Pubsub and writes to GCS is very slow (WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards) takes too long

Question

Currently we have a dataflow job which reads from pubsub and writes avro file using FileIO.writeDynamic to GCS and when we test with say 10000 events/sec , not able to process faster as WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards is very slow. Below is the snippet we are using to write. How can we improve

PCollection<Event> windowedWrites = input.apply("Global Window", Window.<Event>into(new GlobalWindows())
        .triggering(Repeatedly.forever(
            AfterFirst.of(AfterPane.elementCountAtLeast(50000),
                AfterProcessingTime.pastFirstElementInPane().plusDelayOf(DurationUtils
                    .parseDuration(windowDuration))))).discardingFiredPanes());

        return windowedWrites
                        .apply("WriteToAvroGCS", FileIO.<EventDestination, Five9Event>writeDynamic()
                                        .by(groupFn)
                                        .via(outputFn, Contextful.fn(
                                                        new SinkFn()))
                                        .withTempDirectory(avroTempDirectory)
                                        .withDestinationCoder(destinationCoder)
                                        .withNumShards(1).withNaming(namingFn));

We use custom filenaming say in the format, gs://tenantID.<>/eventname/dddd-mm-dd/<uniq_id-shardInder-of-numOfShards-pane-paneIndex.avro>

since the window logic specifies 50K and we use group by , don't want to arbitrarily generate more files — user2313227, Sep 14 '20 at 04:16
+1 Inigo's comment, shard of one would require all values for a window to be shuffled to a single thread. — Reza Rokni, Sep 15 '20 at 06:36

score 0 · Answer 1 · answered Sep 15 '20 at 15:51

0

As mentioned in the comments, the issue is likely withNumShards(1) which forces everything to happen on one worker.

answered Sep 15 '20 at 15:51

robertwb

4,891
18
21

Iñigo · Answer 2 · 2020-09-16T15:21:37.147

As Robert said, when using withNumShards(1) Dataflow/Beam cannot parallelize the writting, making it happen on the same worker. When the bundles are relatively high, this has a big impact on the performance of the pipeline. I made an example to demonstrate this:

I ran 3 pipelines that generate a lot of elements (~2gb), the three of them with 10 n1-standard-1 workers but with 1 shard, 10 shards and 0 shards (Dataflow would choose the amount of shards). This is how they behave:

We see a big difference between 0 or 10 Shard vs 1 Shard total time. If we go to the job with 1 shard, we see that only one worker was doing something (I disabled the autoscaling):

As Reza mentioned, this happens because all elements need to be shuffled into the same worker so it writes the 1 shard.

Note that my example is Batch, which has a different behavior than Streaming when it comes to threading, but the effect on pipeline performance is similar enough (in fact, in Streaming it may be even worst).

Here you have a Python code so you can test this yourself:

    p = beam.Pipeline(options=pipeline_options)

    def long_string_generator():
        string = "Apache Beam is an open source, unified model for defining " \
                 "both batch and streaming data-parallel processing " \
                 "pipelines. Using one of the open source Beam SDKs, " \
                 "you build a program that defines the pipeline. The pipeline " \
                 "is then executed by one of Beam’s supported distributed " \
                 "processing back-ends, which include Apache Flink, Apache " \
                 "Spark, and Google Cloud Dataflow. "

        word_choice = random.sample(string.split(" "), 20)

        return " ".join(word_choice)

    def generate_elements(element, amount=1):
        return [(element, long_string_generator()) for _ in range(amount)]

    (p | Create(range(1500))
       | beam.FlatMap(generate_elements, amount=10000)
       | WriteToText(known_args.output, num_shards=known_args.shards))

    p.run()

So i did experiment with numOfShards.. i set it to 5 and hit OOM https://issues.apache.org/jira/browse/BEAM-6923 Since too many files were written — user2313227, Sep 22 '20 at 22:44
Using more files should not cause OoMs, it would actually help to avoid them. If you are getting OoMs, try using machines with more memory per vCPU and set the shards to 0 so Beam determines the best way to spread it. Also, the JIRA you sent doesn't seem related and it's mark as fixed :D — Iñigo, Sep 23 '20 at 07:30
with writeDymanic by(groupFn) , since we are grouping based in tenantId, event timestamp shouldn't that help with paralellism? — user2313227, Sep 25 '20 at 20:43

Google dataflow job which reads from Pubsub and writes to GCS is very slow (WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards) takes too long

2 Answers2