3

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry column). This is a straightforward task on a single array, but I'm having trouble figuring out how to tell dask that I want it to do this operation on each chunk and return something that is not an array.

function to apply to each chunk:

def get_polys(labeled_blocks):
    polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
                                labeled_blocks.astype('int32'), transform=trans))[:-1]
    # Note: rasterio.features.shapes returns an iterator, hence the conversion to a list here
    return polys

line of code trying to get dask to do this:

test_polygons = da.blockwise(get_polys, '', labeled_arr, 'ij')
test_polygons.compute()

where labeled_arr is the input chunked dask array.

Running as is returns an error saying I have to specify a dtype for da.blockwise. Specifying a dtype returns an AttributeError since the output list type does not have a dtype attribute. I discovered the meta keyword, but still have been unable to get the right syntax to turn my output into a Series or list.

I'm not attached to the above approach, but my overarching goal is: take a labeled, chunked dask dataarray (which does not all fit in memory), extract a list based on computations for each chunk, and generate a concatenated list (or pandas data object) with the outputs from all the chunks in my original chunked array.

Jessica
  • 505
  • 1
  • 3
  • 11

2 Answers2

1

This might work:

import dask
import dask.array as da

# we expect to see 4 blocks here
test_array = da.random.random((4, 4), chunks=(2, 2))

@dask.delayed
def my_func(block):
    # do something fancy
    return list(block)

results = dask.compute([my_func(x) for x in test_array.to_delayed().ravel()])

As you noted, the problem is that list has no dtype. A way around this would be to convert the list into a np.array, but I'm not sure if this will work with all geometry objects (it should be OK for Points, but polygons might be problematic due to varying length). Since you are not interested in forcing these geometries into an array, it's best to treat individual blocks as delayed objects feeding them into your function one at a time (but scaled across workers/processes).

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks @SultanOrazbayev! This approach worked great, though I did end up having to unnest a tuple of length one containing some nested lists. Unfortunately I ran into issues with the geospatial component, because I was trying to reattach coordinates in `my_func`, but I only have a transform computed for my entire dataset, not each dask chunk. I ended up with a solution modified from my initial post: – Jessica Feb 17 '21 at 15:40
1

Here's the solution I ended up with initially, though it still requires a lot of RAM given the concatenate=True kwarg.

poss_list = []
def get_polys(labeled_blocks):
    polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
                        labeled_blocks.astype('int32'), transform=trans))[:-1]
    poss_list.append(polys)
        
da.blockwise(get_bergs, '', labeled_arr, 'ij', 
                meta=pd.DataFrame({'c':[]}), concatenate=True).compute()

If I'm interpreting correctly, this doesn't feed the chunks into my function across workers/processes though (which it seems I can get away with for now).

Update - improved answer using dask.delayed, building on the accepted answer by @SultanOrazbayev

import dask
# onedem = original_xarray_dataarray
poss_list = []

@dask.delayed
def get_bergs(labeled_blocks, pointer, chunk0, chunk1):
            
    # Note: I'm using this in a CRS (polar stereo) with negative y coordinates - it hasn't been tested for other CRSs
    def getpx(chunkid, chunksz):
       amin = chunkid[0] * chunksz[0][0]
       amax = amin + chunksz[0][0]
       bmin = chunkid[1] * chunksz[1][0]
       bmax = bmin + chunksz[1][0]
       return (amin, amax, bmin, bmax)

    # order of all inputs (and outputs) should be y, x when axis order is used
    chunksz = (onedem.chunks['y'], onedem.chunks['x'])
    ymini, ymaxi, xmini, xmaxi = getpx((chunk0, chunk1), chunksz) 

    # use rasterio Windows and rioxarray to construct transform
    # https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html#window-transforms
    chwindow = rasterio.windows.Window(xmini, ymini, xmaxi-xmini, ymaxi-ymini) #.from_slices[ymini, ymaxi],[xmini, xmaxi])
    trans = onedem.rio.isel_window(chwindow).rio.transform(recalc=True)

    return list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(labeled_blocks.astype('int32'), transform=trans))[:-1]

        
for __, obj in enumerate(labeled_arr.to_delayed()):
   for bl in obj:
      piece = dask.delayed(get_bergs)(bl, *bl.key)
      poss_list.append(piece)
        
poss_list = dask.compute(*poss_list)
        
# unnest the list of polygons returned by using dask to polygonize
concat_list = [item for sublist in poss_list for item in sublist if len(item)!=0]
Jessica
  • 505
  • 1
  • 3
  • 11
  • 1
    What happens here is you are modifying `poss_list`, which is external to the `get_polys`, so it can't happen in parallel, which constrains dask. – SultanOrazbayev Feb 17 '21 at 15:50
  • Check section 'Avoid global state' in the best practices for this specific case: https://docs.dask.org/en/latest/delayed-best-practices.html – SultanOrazbayev Feb 17 '21 at 15:52
  • 1
    Thanks for the link - I'll look into it and see if I can come up with a better practices solution that gets my geospatial info in the correctly! – Jessica Feb 17 '21 at 15:58
  • 1
    One somewhat hacky way is to store this info in a file, and ask workers to load the file when necessary. – SultanOrazbayev Feb 17 '21 at 15:59
  • 1
    Yeah - I'm thinking that long term it would be cool to try and build some of this chunk-wise geospatial handling into rioxarray and related tools too. But your separate file solution would make a good stopgap. I suspect I could also generate chunk-wise transforms and keep them in my Xarray dataset containing the dask arrays. – Jessica Feb 17 '21 at 16:19
  • 1
    If I can help, feel free to ping me on GitHub. – SultanOrazbayev Feb 17 '21 at 16:22
  • I tried to implement the dask.delayed approach, but I've been struggling to figure out how to get the function (`my_func()` in the accepted answer) to identify which chunk it is on so that I can also feed it a corresponding geotransform. Neither the `block_id` nor `block_info` kwargs that are available for `map_blocks` and `blockwise` seem to be available, and if I try to use either of those functions I'm back at the original issue of having a non-array function return. – Jessica Feb 23 '21 at 02:53
  • Would it work for you if you knew the sequential order of each block? If yes, then you can use `enumerate` on the delayed objects, like this: `[my_func(x, n) for n, x in enumerate(test_array.to_delayed().ravel())]` – SultanOrazbayev Feb 23 '21 at 07:01
  • If you really need the block number/id, then you could also remove `.ravel()` in the comment above, that will generate a list of lists, so you can iterate over them with appropriate enumeration. – SultanOrazbayev Feb 23 '21 at 07:04
  • 1
    Thanks, @SultanOrazbayev! I've puzzled through using your suggestions and come up with a way to compute a per-chunk geotransform to apply to each raster chunk so that it produces vectors (polygons) in the correct crs (I'm posting the answer separately). – Jessica Feb 25 '21 at 15:13
  • The last three lines of the code should be outside the loop, right? Or maybe I'm reading it wrong. – SultanOrazbayev Feb 25 '21 at 15:56
  • Nice catch - the formatting got messed up when I copy-pasted the code and I missed fixing those indents! They should be all set now. – Jessica Feb 26 '21 at 19:25