Apache Beam dataflow combine per key

Asked Aug 17 '23 at 17:24

Active Aug 17 '23 at 19:34

Viewed 43 times

I have a problem with my pipeline. My goal is a read around 4k parquet files read it as a numpy array and then make some aggregations eg from one file can make 100 keys each key has some numbers of data. Then I have combine per key logic and my goal is reduce each file and for each key and get one value. On smaller dataset it works good but when I ran it with a bigger dataset I am getting two kind of issues one is OOM and second is a hot heys. I think problem is that each key have same number of data but Matrix for some keys is much bigger than other.

I tried with hot key fan-out is a bit better but problem still occur. Do you know how to find correct value for fanout.

edited Aug 17 '23 at 19:34

Olaf Kock

46,930
8
59
90

asked Aug 17 '23 at 17:24

Dawid

I am not able to open this link – Dawid Aug 20 '23 at 23:08
I use hot keys logging. I''ll say that each key has same population but some of them has much bigger matrix. I played with hot key fan-out but I can't see huge improvement. – Dawid Aug 20 '23 at 23:44
I removed the previous comment. Here is the right link: https://cloud.google.com/dataflow/docs/guides/common-errors#hot-key-detected. For OOM, https://cloud.google.com/dataflow/docs/guides/troubleshoot-oom provides some good suggestions. You might tune up number_of_worker_harness_threads (https://cloud.google.com/dataflow/docs/reference/pipeline-options#resource_utilization) – XQ Hu Aug 25 '23 at 21:22

Apache Beam dataflow combine per key

0 Answers0