3

I'm trying to run custom numba vectorized/ufunc functions in a lazy dask pipeline.

When I run the code below I get a ValueError: Core dimension 'm' consists of multiple chunks. I don't understand why m is considered a core dimension. Any idea how I can solve this issue?

import numpy as np
import dask.array as da
import numba
from numba import float64

# Define ufunc that directly takes a 3D array and mean reduce along axis 0
@numba.guvectorize([(float64[:,:,:], float64[:,:])], '(k,m,n)->(m,n)')
def reduce_mean(x, out):
    """Mean reduce a 3D array along the first dimension (axis 0)"""
    nrows = x.shape[0]
    for idx in range(x.shape[1]):
        for idy in range(x.shape[2]):
            col_sum = np.sum(x[:,idx,idy])
            out[idx,idy] = np.divide(col_sum, nrows)

# Apply ufunc on dask array
arr = da.random.random((10,200,200), chunks=(10,50,50)).astype(np.float64)
arr_reduced = reduce_mean(arr)
print(arr_reduced)
Loïc Dutrieux
  • 381
  • 5
  • 16
  • 1
    I get the same thing running this with the `da.as_gufunc` - so it's not numba tripping over dask's map operation. this has me stumped too. – Michael Delgado Mar 18 '22 at 17:46
  • It looks like you need to have only 1 chunk: if you fix the m dimension, then it complains about n, and then k. You can use `map_blocks` instead. – Jérôme Richard Mar 18 '22 at 19:15

1 Answers1

0

As Jérôme mentioned, it looks like gufunc needs a single chunk in all dimensions.

Dask's gufunc has an allow_rechunk that can help:

import numpy as np
import dask.array as da

# Define ufunc that directly takes a 3D array and mean reduce along axis 0
@da.as_gufunc(signature="(k,m,n)->(m,n)", output_dtypes=float, vectorize=True, allow_rechunk=True)
def reduce_mean(x):
    """Mean reduce a 3D array along the first dimension (axis 0)"""
    nrows = x.shape[0]
    out = np.empty((x.shape[1], x.shape[2]))
    for idx in range(x.shape[1]):
        for idy in range(x.shape[2]):
            col_sum = np.sum(x[:,idx,idy])
            out[idx,idy] = np.divide(col_sum, nrows)
    return out

# Apply ufunc on dask array
arr = da.random.random((10,200,200), chunks=(10,50,50)).astype(np.float64)
arr_reduced = reduce_mean(arr, )
arr_reduced.compute()

I couldn't find an equivalent for numba though, and again as Jérôme says, map_blocks might be the only option:

import numpy as np
import dask.array as da
import numba
from numba import float64

# Define ufunc that directly takes a 3D array and mean reduce along axis 0
@numba.guvectorize([(float64[:,:,:], float64[:,:])], '(k,m,n)->(m,n)')
def reduce_mean(x, out):
    """Mean reduce a 3D array along the first dimension (axis 0)"""
    nrows = x.shape[0]
    for idx in range(x.shape[1]):
        for idy in range(x.shape[2]):
            col_sum = np.sum(x[:,idx,idy])
            out[idx,idy] = np.divide(col_sum, nrows)

# Apply ufunc on dask array
arr = da.random.random((10,200,200), chunks=(10,50,50)).astype(np.float64)
arr_reduced = arr.map_blocks(reduce_mean, chunks=(50, 50), drop_axis=0)
arr_reduced.compute()
pavithraes
  • 724
  • 5
  • 9