How to use np.unique on big arrays?

Question

I work with geospatial images in tif format. Thanks to the rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label function).

My idea was to use the unique method of numpy to retrieve the information from these patches as follows:

# identify the clumps
with rio.open(mask) as f:
    mask_raster = f.read(1)

class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True) 
del mask_raster
        
# identify the value
with rio.open(src) as f:
    src_raster = f.read(1)

src_flat = src_raster.flatten()
del src_raster 
    
values = [src_flat[index] for index in indices]
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

My problem is this: For an image of shape 69940, 70936, (84.7 mB on my disk), np.unique tries to allocate an array of the same dim in int64 and I get the following error:

Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64

Is it normal that unique reformats my painting in int64?
Is it possible to force it to use a more optimal format? (even if all my patches were 1 pixel np.int32would be sufficent)
Is there another solution using a function I don't know?

I think your best bet is probably to split your image into, say 16, horizontal strips and process them one at a time. The results are readily combinable afterwards. — Mark Setchell, Dec 22 '20 at 09:34

score 1 · Answer 1 · answered Dec 22 '20 at 13:48

1

The uint64 array is probably allocated during argsort here in the source code.

Since the labels from scipy.ndimage.label are consecutive integers starting at zero you can use numpy.bincount:

num_features = np.max(mask_raster)
count = np.bincount(mask_raster, minlength=num_features+1)

To get values from src you can do the following assignment. It's really inefficient but I don't think it allocates too much memory.

values = np.zeros(num_features+1, dtype=src_raster.dtype)
values[mask_raster] = src_raster

Maybe scipy.ndimage has a function that better suits the use case.

answered Dec 22 '20 at 13:48

user7138814

1,991
9
11

Thank you for your answer, Nice idea to use the `bincount` method it works like a charm. Can you explain me how the last line of code work ? I didn't know it was possible to use `=` between 2 array of different dimensions (`values` and `src_raster`) – Pierrick Rambaud Jan 05 '21 at 16:44
@Pierrick The key thing is that `mask_raster` and `src_raster` have the same shape. – user7138814 Jan 05 '21 at 18:17
In order to use bincount, the array need to be 1D. So I try to use flatten and I get the same error, flatten is automatically reallocated in int64 – Pierrick Rambaud Jan 06 '21 at 08:17

score 0 · Answer 2 · answered Dec 31 '20 at 10:11

0

I think splitting Numpy array into smaller chunks and yield unique:count values will be memory efficient solution as well as changing data type to int16 or similar.

answered Dec 31 '20 at 10:11

Deniz

291
1
3
5

Thank you Deniz for your contribution, Look closely to my question : The image is already in int16 (84 mB on my disk), it's argsort that reallocate in int64. Splitting the image into smaller chunk is at the moment not an option because I cannot garantee that I won't split in the middle of a patch. your solution will force me to recombine the results afterward. also keep in mind that when you suggest answers people like to see actual code and not only suggestion. – Pierrick Rambaud Jan 05 '21 at 07:01

score 0 · Accepted Answer · answered Jan 05 '21 at 17:38

I dig into the scipy.ndimage lib and effectivly find a solution that avoid memory explosion. As it's slicing the initial raster is faster than I thought :

from scipy import ndimage
import numpy as np 

# open the files 
with rio.open(mask) as f_mask, rio.open(src) as f_src: 
    mask_raster = f_mask.read(1)
    src_raster = f_src.read(1)
    
# use patches as slicing material 
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
    loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
    
    # the value of the patch is the value with the highest count 
    idx = np.argmax(loc_counts)
    counts.append(loc_counts[idx])
    values.append(loc_values[idx])
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

How to use np.unique on big arrays?

3 Answers3