0

This is the only error exception in the logs and all Dataflow workers shut down after 3.5 days of processing. It gets through more than half of the load. What does this error mean? Not sure if it is a memory issue that might get solved after increasing the resources. There can be no exception caused by user code because everything is inside a blanket try...except block.

Workflow failed. Causes: S04:Reshuffle/ReshufflePerKey/GroupByKey/Read+Reshuffle/ReshufflePerKey/GroupByKey/GroupByWindow+Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)+Reshuffle/RemoveRandomKeys+ParDo(EnrichCompanies)+ParDo(LogCompanyPipelineRun) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: 

  company-batch-enrichment--02161750-u9wk-harness-3nj5
      Root cause: The worker lost contact with the service.,

  company-batch-enrichment--02161750-u9wk-harness-3nj5
      Root cause: The worker lost contact with the service.,

  company-batch-enrichment--02161750-u9wk-harness-3nj5
      Root cause: The worker lost contact with the service.,

  company-batch-enrichment--02161750-u9wk-harness-3nj5
      Root cause: The worker lost contact with the service.

Below is the resource metrics for the Job. enter image description here

Simant Luitel
  • 33
  • 1
  • 5
  • You probably have a hot key that OOMs the worker. Check if the total cpu of the worker was a divisor of the number of cores per machine (i.e, ~25% for a n1-standard-4) – Iñigo Feb 21 '22 at 23:27
  • @Iñigo I don't have a lot of experience in Dataflow - Shouldn't the Reshuffle() step take care of the hotkey problem if any? I am not sure if I understand the question correctly but the machine is `n1-standard-1` so the core per machine should be just 1. The CPU utilisation of all the workers were 6-12% and right before the failure one of the worker's CPUs shot up to 99.73% and the job failed. Please let me know if you need any more information. Any help would be really appreciated! – Simant Luitel Feb 22 '22 at 02:10
  • Hi OP, have you already referred to this documentation (https://cloud.google.com/dataflow/docs/guides/common-errors#job-failed-four-times)? If not yet, you may refer to the mentioned documentation which includes the error handling (https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#error-and-exception-handling) documentation as well. – Scott B Feb 22 '22 at 07:16

0 Answers0