Poor CPU utilization when transforming netcdfs to zarr and rechunking

Question

I am transferring and rechunking data from netcdf to zarr. The process is slow and is not using much of the CPUs. I have tried several different configurations, sometimes it seems to do slightly better, but it hasn't worked well. Does anyone have any tips for making this run more efficiently?

The last attempt (and some, perhaps all, of the previous attempts) (with single machine, distributed scheduler and using threads) the logs gave this message:

distributed.core - INFO - Event loop was unresponsive in Worker for 10.05s. This is often caused by long-running GIL-holding functions or moving large chunks of data.

CPU utilization. It starts low, and then drops even lower.

Previously I have had errors with memory getting used up, so I am writing the zarr in pieces, using the "stepwise_to_zarr" function below:

def stepwise_to_zarr(dataset, step_dim, step_size, chunks, out_loc, group):
    start = dataset[step_dim].min()
    end = dataset[step_dim].max()
    iis = np.arange(start, end, step_size)
    if end > iis[-1]:
          iis = np.append(iis, end)
            
    lon=dataset.get_index(step_dim)
    
    first = True
    failures = []
    for i in range(1,len(iis)):
        lower, upper = (iis[i-1], iis[i])
        if upper >= end:
            lon_list= [l for l in lon if lower <= l <= upper]

        else:
            lon_list= [l for l in lon if lower <= l < upper]
        sub = dataset.sel(longitude=lon_list)

        rechunked_sub = sub.chunk(chunks)
        write_sync=zarr.ThreadSynchronizer()

        if first:
            rechunked_sub.to_zarr(out_loc, group=group, 
                consolidated=True, synchronizer=write_sync, mode="w")
            first = False
        else:
            rechunked_sub.to_zarr(out_loc, group=group, 
                consolidated=True, synchronizer=write_sync, append_dim=step_dim)


chunks = {'time':8760, 'latitude':21,  'longitude':20}
ds = xr.open_mfdataset("path to data", parallel=True, combine="by_coords")
stepwise_to_zarr(ds, step_size=10, step_dim="longitude", 
    chunks=chunks, out_loc="path to output", group="group name")

In the plot above, the drop from ~6% utilization to ~0.5% utilization seems to coincide with the first "batch" of 10 degreees latitude being finished.

Background info:

I am using a single GCE instance of 32 vCPUs and 256 GB memory.
The data is a about 600 GB and is spread over about 150 netcdf files.
The data is in GCS and I am using Cloud Storage FUSE to read and write data.
I am rechunking the data from chunk sizes: {'time':1, 'latitude':521, 'longitude':1440} to chunksizes:{'time':8760, 'latitude':21, 'longitude':20}

I have tried:

Using the default multiprocessing scheduler
Using distributed scheduler for single machine (https://docs.dask.org/en/latest/setup/single-distributed.html) both with processes=True and processes=False.
Both distributed scheduler and the default multiprocessing sceduler while also setting environment variables to avoid oversubscribing threads, like so:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

as described in best practices(https://docs.dask.org/en/latest/array-best-practices.html?highlight=export#avoid-oversubscribing-threads)

The example in this tutorial about [Rechunker: The missing link for chunked array analytics](https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11) might be helpful. — Hasanul Murad, Sep 14 '20 at 22:11
Thanks @Hasanul Murad, it looks helpful. I will try to use rechunker next time I am rechunking. — Johnmimo, Sep 16 '20 at 18:48
"CPU utilization" is really "CPU + L1 cache + L2 cache + disk", all of which vary a *lot* and are hard to measure. See [S.Gillies, Latency Numbers Every Programmer Should Know](https://gist.github.com/sgillies/f8173c3f6607ba0fb8f02d9964b61cff). (`xar.values.T` or the like can be *terrible* for caching.) — denis, Mar 06 '23 at 13:46

Johnmimo · Answer 1 · 2020-09-16T18:41:33.663

I ended up solving my problem by writing to an intermediate Zarr storage with chunks: {'time':8760, 'latitude':260, 'longitude':360}. This went fast, even though cpu the resources were only fully utilized for a relatively small portion of the job. I then read this intermediate zarr and stored in the final chunking, using a modified version of the stepwise process described in the question. This gave acceptable performance, although not ideal.

CPU utilization when writing to intermediate store

CPU utilization when writing from intermediate to final store

Here is the code:

def stepwise_to_zarr(dataset, step_dim, step_size, encoding, out_loc, group, include_end=True):
    start = dataset[step_dim].min()
    end = dataset[step_dim].max()
    iis = np.arange(start, end, step_size)
    if end > iis[-1]:
          iis = np.append(iis, end)

    lon=dataset.get_index(step_dim)
    first = True
    failures = []
    for i in range(1,len(iis)):
        lower, upper = (iis[i-1], iis[i])
        if upper >= end and include_end:
            lon_list= [l for l in lon if lower <= l <= upper]
        else:
            lon_list= [l for l in lon if lower <= l < upper]
    
        sub = dataset.sel(longitude=lon_list)
        write_sync=zarr.ThreadSynchronizer()
        if first:
            sub_write=sub.to_zarr(output_loc,
                           group=varname,
                           consolidated=True,
                           synchronizer=write_sync,
                           encoding=encoding,
                           mode="w", compute=False)
            first = False
        else:
            sub_write=sub.to_zarr(output_loc,
                   group=varname,
                   consolidated=True,
                   synchronizer=write_sync,
                   append_dim=step_dim,
                   compute=False)
        sub_write.compute(retries=2)

z = xr.open_zarr(input_loc, group=groupname)
new_chunks = {'time':8760, 'latitude':21,  'longitude':20}
z_rechunked=z.chunk(new_chunks)

#Workaround to avoid:NotImplementedError: Specified zarr chunks (8760, 260, 360) would #overlap multiple dask chunks
#See https://github.com/pydata/xarray/issues/2300
encoding = {}
for v in ['var1', 'var2', 'var3']:
    encoding.update({v:z[v].encoding.copy()})
    encoding[v]["chunks"]=(96408, 21, 20)

stepwise_to_zarr(z_rechunked, "longitude", 60, encoding, output_loc, group=groupname)

Note I had to overwrite the encodings to be able to rechunk the zarrs.

This process worked, but was a bit cumbersome. I only did it this way because I had not heard of rechunker. The next time I am rechunking I will try rechunker to it takes care of the issue.

Poor CPU utilization when transforming netcdfs to zarr and rechunking

1 Answers1