1

Writing xarray datasets to AWS S3 takes a surprisingly big amount of time, even when no data is actually written with compute=False.

Here's an example:

import fsspec
import xarray as xr

x = xr.tutorial.open_dataset("rasm")
target = fsspec.get_mapper("s3://bucket/target.zarr")
task = x.to_zarr(target, compute=False)

Even without actually computing it, to_zarr takes around 6 seconds from an EC2 that's in the same region as the S3 bucket.

Looking at the debug logs, there seems to be quite a bit of redirecting going on, as the default region in aiobotocore is set to us-east-2 while the bucket is in eu-central-1.

If I first manually put the default region into the environment variables with

os.environ['AWS_DEFAULT_REGION'] = 'eu-central-1'

then the required time drops to around 3.5 seconds.

So my questions are:

  1. Is there any way to pass the region to fsspec (or s3fs)? I've tried adding s3_additional_kwargs={"region":"eu-central-1"} to the get_mapper method, but that didn't do anything.

  2. Is there any better way to interface with zarr on S3 from xarray than the above (with fsspec)?


versions:

xarray: 0.17.0
zarr: 2.6.1
fsspec: 0.8.4
Val
  • 6,585
  • 5
  • 22
  • 52
  • You might take a look at https://github.com/pydata/xarray/issues/2300#issuecomment-805883595 "zarr and xarray chunking compatibility and to_zarr performance" – Josh Mar 25 '21 at 07:33
  • @Josh thanks, but I don't think this is a chunk issue – Val Mar 25 '21 at 08:02
  • @Val As per the [s3fs documentation](https://github.com/dask/s3fs/blob/7d80c3198923cc6016b67bd8e857642a17f4f7e5/s3fs/core.py#L619) they show `region_name` as a kwargs and also the fsspec issue regarding using the [region](https://github.com/intake/filesystem_spec/issues/386#issuecomment-683787104). However zarr is widely popular for huge dataset. Unfortunately, I don't have current tools to verify the above links – Nagaraj Tantri Mar 26 '21 at 03:32
  • 1
    @NagarajTantri setting `client_kwargs={'region_name':'eu-central-1'}` does the trick and answers point #1. If you want, please post an answer I can accept. Thanks! – Val Mar 26 '21 at 12:23
  • Done, @Val. Do check it. – Nagaraj Tantri Mar 26 '21 at 12:48

1 Answers1

3

While checking their documentation, for s3fs documentation they show region_name as a kwargs and also the fsspec issue regarding using the region

So you can use something like client_kwargs={'region_name':'eu-central-1'} in the get_mapper, like:

fsspec.get_mapper("s3://bucket/target.zarr", 
                  client_kwargs={'region_name':'eu-central-1'})

Also, zarr is widely popular for a huge dataset.

Nagaraj Tantri
  • 5,172
  • 12
  • 54
  • 78