Efficiently mask an image with a label mask

Question

I have an image that I read in with tifffile.imread and it is turned into a 3D matrix, with the first dimension representing the Y coordinate, the second the X and the third the channel of the image (these images are not RGB and so there can be an arbitrary number of channels).

Each of these images has a label mask which is a 2D array that indicates the position of objects in the image. In the label mask, pixels that have a value of 0 do not belong to any object, pixels that have a value of 1 belong to the first object, pixels that have a value of 2 belong to the second object and so on.

What I would like to calculate is for each object and for each channel of the image I would like to know the mean, median, std, min and max of the channel. So, for example, I would like to know the mean, mediam std, min and max values of the first channel for pixels in object 10.

I have written code to do this but it is very slow (shown below) and I wondered if people had a better way or knew a package(s) that might be helpful i making this faster/doing this more efficiently. (Here the word 'stain' means the same as channel)

sample = imread(input_img)
label_mask = np.load(input_mask)

n_stains = sample.shape[2]
n_labels = np.max(label_mask)

#Create empty dataframe to store intensity measurements
intensity_measurements = pd.DataFrame(columns = ['sample', 'label', 'stain', 'mean', 'median', 'std', 'min', 'max'])

for label in range(1, n_labels+1):
    for stain in range(n_stains):
        #Extract stain and label
        stain_label = sample[:,:,stain][label_mask == label]

        #Calculate intensity measurements
        mean = np.mean(stain_label)
        median = np.median(stain_label)
        std = np.std(stain_label)
        min = np.min(stain_label)
        max = np.max(stain_label)

        #Add intensity measurements to dataframe
        intensity_measurements = intensity_measurements.append({'sample' : args.input_img, 'label': label, 'stain': stain, 'mean': mean, 'median': median, 'std': std, 'min': min, 'max': max}, ignore_index=True)

What are the dimensions of the images and number of channels and objects please? And how long does it take with your code? — Mark Setchell, Nov 07 '22 at 07:39
The inputs can be of any size, our images currently range between 128x128 and 150000x80000 and have any number of channels between 2 and 5. The current runtime on the large image with 3 channels and this code is >48hrs — Hugh Warden, Nov 07 '22 at 11:45
Just curious if you tried the method I proposed:) Especially on huge images. I guess it would need immense amount of memory. — Raibek, Nov 07 '22 at 12:36
As Python only uses a single core/thread, you could consider using multiprocessing as nearly all CPUs have 4 or more cores nowadays. Assign a label, or some other degree of granularity to each process. Or alternatively, consider coding the 5 lines (np.mean(), np.median() ... np.max()) using `numba` — Mark Setchell, Nov 07 '22 at 14:04
@Raibek I have run your algorithm but it seems to be taking noticable longer than the one in the original post. Unfortunately our cluster is having maintenance right now, but once it's back up I'm hoping to come back with some benchmarks — Hugh Warden, Nov 08 '22 at 21:08
Thanks for the feedback!:) Would be nice of you to share the benchmark results by methods if you'll have time for this of course. — Raibek, Nov 08 '22 at 23:21

score 1 · Answer 1 · answered Nov 07 '22 at 05:29

Your code is slow because you iterate over the whole image for each of the labels. This is an operation of O(n k), for n pixels and k labels. You could instead iterate over the image, and for each pixel examine the label, then update the measurements for that label with the pixel values. This is an operation of O(n). You'd keep an accumulator for each label and each measurement (standard deviation requires accumulating the square sum as well as the sum, but the sum you're already accumulating for the mean). The only measure that you cannot compute this way is the median, as it requires a partial sort of the full list of values.

This would obviously be a much cheaper operation, except for the fact that Python is a slow, interpreted language, and looping over each pixel in Python leads to a very slow program. In a compiled language you would implement it this way though.

See this answer for a way to implement this efficiently using NumPy functionality.

Using the DIPlib library (disclosure: I'm an author) you can apply the operation as follows (the median is not implemented). Other image processing libraries have similar functionality, though might not be as flexible with the number of channels.

import diplib as dip

# sample = imread(input_img)
# label_mask = np.load(input_mask)
# Alternative random data so that I can run the code for testing:
sample = imageio.imread("../images/trui_c.tif")
label_mask = np.random.randint(0, 20, sample.shape[:2], dtype=np.uint32)

sample = dip.Image(sample, tensor_axis=2)
msr = dip.MeasurementTool.Measure(label_mask, sample, features=["Mean", "StandardDeviation", "MinVal", "MaxVal"])
print(msr)

This prints out:

   |                                 Mean |                    StandardDeviation |                               MinVal |                               MaxVal |
-- | ------------------------------------ | ------------------------------------ | ------------------------------------ | ------------------------------------ |
   |      chan0 |      chan1 |      chan2 |      chan0 |      chan1 |      chan2 |      chan0 |      chan1 |      chan2 |      chan0 |      chan1 |      chan2 |
   |            |            |            |            |            |            |            |            |            |            |            |            |
-- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
 1 |      82.26 |      41.30 |      24.77 |      57.77 |      52.16 |      48.22 |      5.000 |      3.000 |      1.000 |      255.0 |      255.0 |      255.0 |
 2 |      82.02 |      41.18 |      24.85 |      52.16 |      48.22 |      48.33 |      3.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
 3 |      82.39 |      41.17 |      24.93 |      48.22 |      48.33 |      48.48 |      1.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
 4 |      82.14 |      41.62 |      25.03 |      48.33 |      48.48 |      48.47 |      1.000 |      1.000 |      0.000 |      255.0 |      255.0 |      255.0 |
 5 |      82.89 |      41.45 |      24.94 |      48.48 |      48.47 |      48.54 |      1.000 |      0.000 |      1.000 |      255.0 |      255.0 |      255.0 |
 6 |      82.83 |      41.60 |      25.26 |      48.47 |      48.54 |      48.65 |      0.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
 7 |      81.95 |      41.77 |      25.51 |      48.54 |      48.65 |      48.22 |      1.000 |      1.000 |      2.000 |      255.0 |      255.0 |      255.0 |
 8 |      82.93 |      41.36 |      25.19 |      48.65 |      48.22 |      48.11 |      1.000 |      2.000 |      1.000 |      255.0 |      255.0 |      255.0 |
 9 |      81.88 |      41.70 |      25.07 |      48.22 |      48.11 |      47.69 |      2.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
10 |      81.46 |      41.40 |      24.82 |      48.11 |      47.69 |      48.32 |      1.000 |      1.000 |      2.000 |      255.0 |      255.0 |      255.0 |
11 |      81.33 |      40.98 |      24.76 |      47.69 |      48.32 |      48.85 |      1.000 |      2.000 |      1.000 |      255.0 |      255.0 |      255.0 |
12 |      82.30 |      41.55 |      25.12 |      48.32 |      48.85 |      48.75 |      2.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
13 |      82.43 |      41.50 |      25.15 |      48.85 |      48.75 |      48.89 |      1.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
14 |      83.29 |      42.11 |      25.65 |      48.75 |      48.89 |      48.32 |      1.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
15 |      83.20 |      41.64 |      25.28 |      48.89 |      48.32 |      48.13 |      1.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
16 |      81.51 |      40.92 |      24.76 |      48.32 |      48.13 |      48.73 |      1.000 |      1.000 |      1.000 |      255.0 |      255.0 |      255.0 |
17 |      81.81 |      41.31 |      24.71 |      48.13 |      48.73 |      48.49 |      1.000 |      1.000 |      0.000 |      255.0 |      255.0 |      255.0 |
18 |      83.58 |      41.85 |      25.25 |      48.73 |      48.49 |      32.20 |      1.000 |      0.000 |      1.000 |      255.0 |      255.0 |      212.0 |
19 |      82.12 |      41.24 |      25.06 |      48.49 |      32.20 |      24.44 |      0.000 |      1.000 |      1.000 |      255.0 |      212.0 |      145.0 |

I don't have an efficient solution for the median. You'd have to split the image into a separate array for each label, then run the median over that. This would be equally efficient as the above, but use up much more memory.

This looks really good thank you. I haven't managed to get diplib to work (I'll keep trying and if I fail I'll raise an issue) but the numpy_groupies seems to be working. I have another question about that but will raise it on that thread rather than this one. — Hugh Warden, Nov 07 '22 at 11:47
To clarify fr anyone trying to solve the same problem, I sped up my calcs using ng.aggregate from the link, I'm just asking on there whether I can calculate multiple functions per iteration — Hugh Warden, Nov 07 '22 at 11:58

Raibek · Answer 2 · 2022-11-07T10:08:53.853

The proposed method below utilizes matrix multiplications in order to speed up the calculations.
It is built on two crucial Numpy tools:

https://numpy.org/doc/stable/reference/generated/numpy.einsum.html?highlight=einsum#numpy.einsum

Evaluates the Einstein summation convention on the operands.

https://numpy.org/doc/stable/reference/maskedarray.html

Masked arrays are arrays that may have missing or invalid entries. The numpy.ma module provides a nearly work-alike replacement for numpy that supports data arrays with masks.

masked array update: The initial code was updated with the masked array use after https://stackoverflow.com/users/7328782/cris-luengo spotted a mistake in my intial code.

This replaces all the non-selected pixels for a given label with a 0 value, and includes all those zeros into the measurements.

Now we mask the non-selected pixels before measurement calculations.

import numpy as np
import numpy.ma as ma
import pandas as pd

sample = imread(input_img)
label_mask = np.load(input_mask)

n_labels = np.max(label_mask)

# let's create boolean label masks for each label 
# producing 3D matrix where 1st axis is label
label_mask_unraveled = np.equal.outer(label_mask, np.arange(1, n_labels +1))

# now we can apply these boolean label masks simultaniously
# to all the sample channels with help of 'einsum' producing 4D matrix, 
# where the 1st axis is channel/stain and the 2nd axis is label
sample_label_masks_applied = np.einsum("ijk,ijl->klij", sample, label_mask_unraveled)

# in order to exclude the non-selected pixels 
# from meausurement calculations, we mask the pixels first
non_selected_pixels_mask = np.moveaxis(~label_mask_unraveled, -1, 0)[np.newaxis, :, :, :]
non_selected_pixels_mask = np.repeat(non_selected_pixels_mask, sample.shape[2], axis=0)

sample_label_masks_applied = ma.masked_array(sample_label_masks_applied, non_selected_pixels_mask)    

# intensity measurement calculations
# embedded into pd.DataFrame initialization
intensity_measurements = pd.DataFrame(
    {
        "sample": args.input_img,
        "label": sample.shape[2] * list(range(1, n_labels+1)),
        "stain": n_labels * list(range(sample.shape[2])),
        "mean": ma.mean(sample_label_masks_applied, axis=(2, 3)).flatten(),
        "median": ma.median(sample_label_masks_applied, axis=(2, 3)).flatten(),
        "std": ma.std(sample_label_masks_applied, axis=(2, 3)).flatten(),
        "min": ma.min(sample_label_masks_applied, axis=(2, 3)).flatten(),
        "max": ma.max(sample_label_masks_applied, axis=(2, 3)).flatten() 
    }
)

This replaces all the non-selected pixels for a given label with a 0 value, and includes all those zeros into the measurements. The min is always 0, unless there are negative values in the image. For the mean and std you could use this method, but computing the sum and the sum of squares instead, and also counting the number of values for each label. — Cris Luengo, Nov 07 '22 at 05:31

score 0 · Accepted Answer · answered Nov 23 '22 at 11:53

I've found a good solution that works for me using scikit image, specifically the regionprops functions.

import numpy as np
import pandas as pd
from skimage.measure import regionprops, regionprops_table
np.random.seed(42)

Here is a random "image" and label mask of that image

img = np.random.randint(0, 255, size=(100, 100, 3))
mask = np.zeros((100, 100)).astype(np.uint8)
mask[20:50, 20:50] = 1
mask[65:70, 65:70] = 2

There is already an inbuilt function for measuring the mean intensity for each channel that is very fast

pd.DataFrame(regionprops_table(mask, img, properties=['label', 'mean_intensity']))

You can also pass custom functions that take a binary mask and one channel of an intensity image to regionprops_table

def my_mean_func(mask, img):
    return np.mean(img[mask])

pd.DataFrame(regionprops_table(mask, img, properties=['label'], extra_properties=[my_mean_func]))

This is fast because the binary mask and intensity image passed to the custom function is the minimum bounding box of the mask. Therefore, the computations are much faster as they are operating over a much smaller area.

This only allows the user to calculate values per channel, but there is a generalisation that returns a 3D matrix of the selected region so that between channel measurements (or any measurements you like can be made).

props = regionprops(mask, img)

for prop in props:
    print("Region ", prop['label'], ":")
    print("Mean intensity: ", prop['mean_intensity'])
    print()

This is only an example of the very basic functionality.

I haven't had time to benchmark any of the above algorithms, but the ones used in this answer are very very fast indeed and I use them to operate over very large images quite quickly. However, it is important to note here that one of the reasons why this is so much faster for me is because I expect each object (each entry of the label mask that has the same value) to be only situated in a very small part of the image. Therefore, the minimum bounding box representation returned by regionprops is much much smaller than the original image and drastically speeds up computation.

Thank you very much to everyone for their help.

Efficiently mask an image with a label mask

3 Answers3