3

I am trying to perform some calculations on xarray data. The data has lat, lon and time coordinates, and multiple data variables.

My calculation is performed on a single timestep. In an attempt to parralellize this I am using the dask distributed client.

I am running the client on a single machine like this:

client = Client(processes=False)
dask.config.set({'distributed.comm.timeouts.connect': '20s'})

I increase the config timout time, since I have been having issues with the timeout. (based on this comment: Long running workers blocking GIL timeout errors)

Some of my tasks (invocations of the same function over single timesteps) are executing fine, while for others I get warnings like these:

distributed.worker - WARNING -  Compute Failed
Function:  slom_per_timeslice
args:      ((numpy.datetime64('1987-02-06T10:00:00.000000000'), <xarray.Dataset>
Dimensions:     (lat: 10, lon: 13)
Coordinates:
  * lat         (lat) float64 36.88 40.88 44.88 48.88 ... 64.88 68.88 72.88
  * lon         (lon) float64 -12.88 -8.875 -4.875 -0.875 ... 27.12 31.12 35.12
    time        datetime64[ns] 1987-02-06T10:00:00
Data variables:
    t2m         (lat, lon) float32 dask.array<chunksize=(10, 13), meta=np.ndarray>
    solarCF     (lat, lon) float32 dask.array<chunksize=(10, 13), meta=np.ndarray>
    windCF_off  (lat, lon) float32 dask.array<chunksize=(10, 13), meta=np.ndarray>
    windCF_on   (lat, lon) float32 dask.array<chunksize=(10, 13), meta=np.ndarray>))
kwargs:    {}
Exception: OSError("Timed out trying to connect to 'inproc://192.168.xxx.xx/5050/1' after 10 s: Timed out trying to connect to 'inproc://192.168.xxx.xx/5050/1' after 10 s: connect() didn't finish in time")

The adress inproc points to is the adress of the scheduler:

Scheduler: inproc://192.168.xxx.xx/5050/1

Eventually it crashes the script with the following error:

OSError: Timed out trying to connect to 'inproc://192.168.xxx.xx/5050/1' after 10 s: connect() didn't finish in time

Note that these are timing out after 10s, rather than the 20s I added to the config. The timeout time does appear to have changed however, since this is the error I get when clicking on the worker logs in the dask dashboard (i get an 500: internal server error) :

distributed.utils - ERROR - Timed out trying to connect to 'inproc://192.168.xxx.xx/5050/3' after 20 s: Timed out trying to connect to 'inproc://192.168.xxx.xx/5050/3' after 20 s: connect() didn't finish in time

I have run the code on both the research groups pc (where above errors occur) and my personal desktop. The same code runs fine on my personal desktop. The code is run on the same environment with the only difference being that I run windows, and the research group Ubuntu. I ran the client.get_versions(check=True) command and it outputs a dict of uses packages:

{'scheduler': {'host': {'python': '3.8.5.final.0',
   'python-bits': 64,
   'OS': 'Linux',
   'OS-release': '4.15.0-121-generic',
   'machine': 'x86_64',
   'processor': 'x86_64',
   'byteorder': 'little',
   'LC_ALL': 'None',
   'LANG': 'en_GB.UTF-8'},
  'packages': {'python': '3.8.5.final.0',
   'dask': '2.26.0',
   'distributed': '2.26.0',
   'msgpack': '1.0.0',
   'cloudpickle': '1.6.0',
   'tornado': '6.0.4',
   'toolz': '0.10.0',
   'numpy': '1.19.1',
   'lz4': None,
   'blosc': None}},
 'workers': {'inproc://192.168.178.44/5050/3': {'host': {'python': '3.8.5.final.0',
    'python-bits': 64,
    'OS': 'Linux',
    'OS-release': '4.15.0-121-generic',
    'machine': 'x86_64',
    'processor': 'x86_64',
    'byteorder': 'little',
    'LC_ALL': 'None',
    'LANG': 'en_GB.UTF-8'},
   'packages': {'python': '3.8.5.final.0',
    'dask': '2.26.0',
    'distributed': '2.26.0',
    'msgpack': '1.0.0',
    'cloudpickle': '1.6.0',
    'tornado': '6.0.4',
    'toolz': '0.10.0',
    'numpy': '1.19.1',
    'lz4': None,
    'blosc': None}}},
 'client': {'host': {'python': '3.8.5.final.0',
   'python-bits': 64,
   'OS': 'Linux',
   'OS-release': '4.15.0-121-generic',
   'machine': 'x86_64',
   'processor': 'x86_64',
   'byteorder': 'little',
   'LC_ALL': 'None',
   'LANG': 'en_GB.UTF-8'},
  'packages': {'python': '3.8.5.final.0',
   'dask': '2.26.0',
   'distributed': '2.26.0',
   'msgpack': '1.0.0',
   'cloudpickle': '1.6.0',
   'tornado': '6.0.4',
   'toolz': '0.10.0',
   'numpy': '1.19.1',
   'lz4': None,
   'blosc': None}}}

(these match apart from the operating systems for my desktop and research group computer)

What causes some of my tasks to fail?

phrasper
  • 41
  • 4
  • were you able to fix this issue? – Singh Apr 12 '22 at 14:55
  • 1
    No, unfortunately I did not manage to solve this. We gave up on using dask and reserved more calculation time on the research groups calculation cluster. – phrasper Apr 14 '22 at 09:56
  • thanks for replying, i m currently facing this same issue and there aren't many solution for this – Singh Apr 14 '22 at 14:30

0 Answers0