2

I want to be able to compress arrays stored inside a netcdf file via manipulation of the scale factor and add offsets to be applied to arrays via conversion of array data types (i.e float32 to int16)

I want to make raster data which would normally be too big to work with in python into smaller more manageable rasters. I know that it is possible to apply scale factors and offsets with respect to netcdf data as to not only make file sizes smaller but would this same logic be applicable when the arrays are loaded in for easier memory management. I have another method to make large arrays manageable with a different numpy already but I would like to achieve this with netcdfs.

I already have the following code which is based off several links http://james.hiebert.name/blog/work/2015/04/18/NetCDF-Scale-Factors.html

The test file I am using is one I've generated for myself is a netcdf file which houses a float32 numpy array and is converted from a geotiff file to the netcdf via way of gdal translate

import netCDF4
from math import floor
import numpy as np


def compute_scale_and_offset(min, max, n):
    # stretch/compress data to the available packed range
    scale_factor = (max - min) / (2 ** n - 1)
    # translate the range to be symmetric about zero
    add_offset = min + 2 ** (n - 1) * scale_factor
    return scale_factor, add_offset

def pack_value(unpacked_value, scale_factor, add_offset):
    return unpacked_value - add_offset / scale_factor

def unpack_value(packed_value, scale_factor, add_offset):
    return packed_value * scale_factor + add_offset

netcdf_path = r"path/to/netcdf"

nc = netCDF4.Dataset(netcdf_path,"a")
data = nc.variables['Band1'][:]

scale_factor,offset = compute_scale_and_offset(np.min(data),np.max(data),16)
data = pack_value(data,scale_factor,offset)
data_b = data.astype(np.int16,copy=False)
nc.variables['Band1'][:] = data_b

nc.close()

Right now the file I am working with does not change in size when I run the above code but the core data array does change in terms of what values it outputs. My expected result would be an alteration to the above code which would work any generic netcdf file to convert the data array and allow for the offsets to be applied and stored in the file so that they are loaded in upon read in from netcdf4.

Krios101
  • 43
  • 4
  • 1
    Not an answer to your question, but I think that should be `return (unpacked_value - add_offset) / scale_factor`... – Bart Jul 24 '19 at 16:53
  • 1
    Also, `nc.variables['Band1'][:] = data_b` is just going to cast `data_b` to whatever datatype `Band1` is. I don't think NetCDF4 allows you to change the NetCDF datatype in place, and deleting variables (and recreating them with another datatype) is also difficult/impossible, so you might have to create an entirely new NetCDF file with the correct data types, which sounds like a lot of trouble for such a simple task `:-(` – Bart Jul 24 '19 at 17:27

1 Answers1

2

Xarray can handle this using to_netcdf

import xarray as xr


def compute_scale_and_offset(da, n=16):
    """Calculate offset and scale factor for int conversion

    Based on Krios101's code above.
    """

    vmin = np.min(da).item()
    vmax = np.max(da).item()

    # stretch/compress data to the available packed range
    scale_factor = (vmax - vmin) / (2 ** n - 1)

    # translate the range to be symmetric about zero
    add_offset = vmin + 2 ** (n - 1) * scale_factor

    return scale_factor, add_offset


ds = xr.open_dataset("infile.nc")

scale_factor, add_offset = compute_scale_and_offset(ds['my_var'])

ds.to_netcdf(outfile, encoding={"my_var": {
    "dtype": 'int16',
    "scale_factor": scale_factor,
    "add_offset": add_offset,
    "_FillValue": -32767,
}})
naught101
  • 18,687
  • 19
  • 90
  • 138
  • compute_scale_and_offset(da['my_var']) should be compute_scale_and_offset(ds['my_var']) – Florian Feb 23 '21 at 17:51
  • Thanks @Florian, fixed. – naught101 Feb 23 '21 at 23:09
  • Because you need to read the entire data array to figure out the min/max, if your dataset is big, its useful to use Dask. Dask doesn't like `.item()` though. Dask does work if you change `vmin = np.min(da).item()` and `vmax=np.max(da.item()` to `vmin = float(da.min().values)` and `vmax = float(da.max().values)` – Rich Signell Jul 23 '21 at 19:02