1

I am using coiled to spin up a cluster and using dask to do some manipulation on a csv read from an S3 bucket. However, at some point my workers are getting killed. When I inspected the logs, the following task is killing them.

distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 65, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 70, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 71, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 86, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 1, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 8, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 45, 0) marked as failed because 3 workers died while trying to run it
distributed.scheduler - INFO - Task ('read-csv-values-values-00474dd1e867972e5b6636ffb4e71705', 39, 0) marked as failed because 3 workers died while trying to run it

So, then, I moved the csv out of the s3 bucket to my local repo and ran it and still the read csv would fail.

Another point is that the read csv was working properly for priors data manipulation but for some dummy encoders, .compute() and date manipulation, the workers are getting killed.

Any idea what might be going on?

M_x
  • 782
  • 1
  • 8
  • 26
QuantNoob
  • 13
  • 3

1 Answers1

0

There are at least two possibilities:

  1. workers do not have sufficient resources to perform their task, a common reason is insufficient memory;

  2. the task itself is problematic, for example (one out of many possible reasons) there is a mismatched datatypes, so a function that expects an integer is unable to perform computation with a nan.

To minimize the risk of failed tasks due to the second possibility, it's a good idea to test the code on a pandas dataframe.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46