0

Say I create a dataset with an integer variable.

import xarray as xr
import numpy as np

int_var = np.random.randint(0, 10, 10)
ds = xr.Dataset(data_vars={"int_var": (("x"), int_var)},
                 coords={"x": range(10)})

Then I save it, providing an encoding and an integer fill value:

from numcodecs import Blosc

compressor = Blosc(cname = 'lz4')
encoding = {v: {'compressor': compressor, 'dtype': ds[v].dtype, "_FillValue": -9999}
   for v in ds.data_vars}
ds.to_zarr(store="example.zarr", mode='w', consolidated=True, encoding=encoding)

When I then read the data, the type has changed from int32 to float64. However, the type is still set as <i8 in the .zmetadata file, and I see that the _FillValue is correctly being loaded as an int.

# Loads int_var with dtype float64
reloaded = xr.open_zarr("example.zarr", consolidated=True)

I need it to be an integer type since I'm storing indices and my job is to make the data easy to use––it's not acceptable for users to have to change the dtype for every integer column every time they need it.

I noticed that if I just delete _FillValue from the encoding dict, the type is maintained. What's going on and how do I fix it?

Adair
  • 1,697
  • 18
  • 22

1 Answers1

1

By default, xarray tries to replace the _FillValue fields with NaN. NaN is a float value, so xarray goes ahead and converts the whole array to float if a _FillValue exists.

This happens with xarray.open_dataset as well, not just xarray.open_zarr.

The solution is to pass mask_and_scale=False.

# Loads int_var with dtype int32 as desired
reloaded = xr.open_zarr("example.zarr", consolidated=True, mask_and_scale=False)
Adair
  • 1,697
  • 18
  • 22