Dask Cluster not processing any data and just sitting idle after a while, which was working perfectly fine couple of weeks before

Question

So I'm trying to parallelize the process using the dask cluster. Here's my try.

Getting clusters ready:

gateway = Gateway(
    address="http://traefik-pangeo-dask-gateway/services/dask-gateway",
    public_address="https://pangeo.aer-gitlab.com/services/dask-gateway",
    auth="jupyterhub",
)
options = gateway.cluster_options()
options

cluster = gateway.new_cluster(
    cluster_options=options,
)
cluster.adapt(minimum=90, maximum=100)
client = cluster.get_client()
cluster
client

Then I have a function which will load files from S3 and process it and then upload it back to different s3 bucket.

Function process GOES data and select specific region from it and save that to nc file then to S3:

def get_records(rec):
    
    d=[rec[-1][0:4], rec[-1][4:6], rec[-1][6:8], rec[-1][9:11], rec[-1][11:13]] 
    
    yr=d[0]
    mo=d[1]
    da=d[2]
    hr=d[3]
    mn=d[4]
    
    ps = s3fs.S3FileSystem(anon=True)

    period = pd.Period(str(yr)+str('-')+str(mo)+str('-')+str(da), freq='D')
    dy=period.dayofyear
    print(dy)

    cc=[7,8,9,10,11,12,13,14,15,16]  #look at the IR channels only for now
    dy="{0:0=3d}".format(dy)

    # this loop is for 10 different channels
    for c in range(10):
        ch="{0:0=2d}".format(cc[c])

    # opening 2 different time slices of given particular record
        F1=xr.open_dataset(ps.open(ps.glob('s3://noaa-goes16/ABI-L1b-RadF/'+str(yr)+'/'+str(dy)+'/'+str("{0:0=2d}".format(hr))+'/'+'OR_ABI-L1b-RadF-M3C'+ch+'*')[-2]))[['Rad']]

        F2=xr.open_dataset(ps.open(ps.glob('s3://noaa-goes16/ABI-L1b-RadF/'+str(yr)+'/'+str(dy)+'/'+str("{0:0=2d}".format(hr))+'/'+'OR_ABI-L1b-RadF-M3C'+ch+'*')[-1]))[['Rad']]
      
    # Selecting data as per given record radiance
        G1 = F1.where((F1.x >= (rec[0]-0.005)) & (F1.x <= (rec[0]+0.005)) & (F1.y >= (rec[1]-0.005)) & (F1.y <= (rec[1]+0.005)), drop=True)
        G2 = F2.where((F2.x >= (rec[0]-0.005)) & (F2.x <= (rec[0]+0.005)) & (F2.y >= (rec[1]-0.005)) & (F2.y <= (rec[1]+0.005)), drop=True)
       
    # Concating 2 time slices togethere
        G = xr.concat([G1, G2], dim  = 'time')

    # Concatiating different channels
        if c == 0:
            T = G    
        else:
            T = xr.concat([T, G], dim = 'channel')

    # Saving into nc file and storing them to S3
    path = rec[-1]+'.nc'
    T.to_netcdf(path)
    fs.put(path, bucket+path)

Using dask to run them parallely. I'm making all cluster run 50 files once then clearing there memory and running it again for next 50 file

for j in range(0, len(records), 50):
   files = []
   for i in range(j, j+50):
        s3_ds = dask.delayed(get_records)(records[i])
        files.append(s3_ds)

    files = dask.compute(*files)
    client.restart()

So now the problem is my cluster will process files for a while, like I have 10 clusters running, So after a while one by one they'll just stop processing the data and will sit ideally, even though they have memory left in them. They won't do anything. They'll process 20-30 files and then do nothing. So I tried giving just 20 files at once, then they'll stop processing after 10-12 files. Below I have attached image how some cluster sit ideally even though they have memory. And the main thing is like couple of weeks before I was running the same code and it was running perfectly fine. I don't know what's the problem now.

Those two packages are used for communication with s3. There were previously reports of hangs in combination with dask. — mdurant, Apr 11 '21 at 20:27
@mdurant so I should open files using fsspec? Can you please share one line of code or link how to use fsspec for s3 files? I'm not able to find something I'm looking for in documentation — Chris_007, Apr 11 '21 at 21:22
@mdurant I upgraded version of both of them and the current version are: ```Name: fsspec Version: 0.9.0``` and ```Name: s3fs Version: 0.6.0 ``` but I'm still running into same issue! Do I need to make any change in the code? — Chris_007, Apr 11 '21 at 22:21

Dask Cluster not processing any data and just sitting idle after a while, which was working perfectly fine couple of weeks before

0 Answers0