I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array
below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy).
I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=(6000, 1, 1, 1)
so I have a separate chunk for each grid point.
This is my function for getting the coefficients of the cubic polynomial (the time_axis
axis values are a global 1D numpy array defined elsewhere):
def my_polyfit(data):
return numpy.polyfit(data.squeeze(), time_axis, 3)
(So in this case, numpy.polyfit
returns a list of length 4)
and this is the command I thought I'd need to apply it to each chunk:
dask_array.map_blocks(my_polyfit, chunks=(4, 1, 1, 1), drop_axis=0, new_axis=0).compute()
Whereby the time axis is now gone (hence drop_axis=0
) and there's a new coefficient axis in it's place (of length 4).
When I run this command I get IndexError: tuple index out of range
, so I'm wondering where/how I've misunderstood the use of map_blocks
?