0

I have a problem with my pipeline. My goal is a read around 4k parquet files read it as a numpy array and then make some aggregations eg from one file can make 100 keys each key has some numbers of data. Then I have combine per key logic and my goal is reduce each file and for each key and get one value. On smaller dataset it works good but when I ran it with a bigger dataset I am getting two kind of issues one is OOM and second is a hot heys. I think problem is that each key have same number of data but Matrix for some keys is much bigger than other.

I tried with hot key fan-out is a bit better but problem still occur. Do you know how to find correct value for fanout.

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
Dawid
  • 11
  • 1
  • I am not able to open this link – Dawid Aug 20 '23 at 23:08
  • I use hot keys logging. I''ll say that each key has same population but some of them has much bigger matrix. I played with hot key fan-out but I can't see huge improvement. – Dawid Aug 20 '23 at 23:44
  • I removed the previous comment. Here is the right link: https://cloud.google.com/dataflow/docs/guides/common-errors#hot-key-detected. For OOM, https://cloud.google.com/dataflow/docs/guides/troubleshoot-oom provides some good suggestions. You might tune up number_of_worker_harness_threads (https://cloud.google.com/dataflow/docs/reference/pipeline-options#resource_utilization) – XQ Hu Aug 25 '23 at 21:22

0 Answers0