3

I am analyzing histology tissue images stained with a specific protein marker which I would like to identify the positive pixels for that marker. My problem is that thresholding on the image gives too much false positives which I'd like to exclude.

I am using color deconvolution (separate_stains from skimage.color) to get the AEC channel (corresponding to the red marker), separating it from the background (Hematoxylin blue color) and applying cv2 Otsu thresholding to identify the positive pixels using cv2.threshold(blur,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU), but it is also picking up the tissue boundaries (see white lines in the example picture, sometimes it even has random colors other than white) and sometimes even non positive cells (blue regions in the example picture). It's also missing some faint positive pixels which I'd like to capture.

Overall: (1) how do I filter the false positive tissue boundaries and blue pixels? and (2) how do I adjust the Otsu thresholding to capture the faint red positives?

Adding a revised example image -

  1. top left the original image after using HistoQC to identify tissue regions and apply the mask it identified on the tissue such that all of the non-tissue regions are black. I should tru to adjust its parameters to exclude the folded tissue regions which appear more dark (towards the bottom left of this image). Suggestions for other tools to identify tissue regions are welcome.
  2. top right hematoxylin after the deconvolution
  3. bottom left AEC after the deconvolution
  4. bottom right Otsu thresholding applied not the original RGB image trying to capture only the AEC positives pixels but showing also false positives and false negatives

Thanks

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
AMM
  • 2,195
  • 2
  • 20
  • 28
  • So the 2nd picture is the unmixed red stain? What is that large light-gray area in the bottom right part? Looks artificial? How come there is so much unspecific staining? – Cris Luengo Apr 08 '20 at 02:56
  • I find the overlay not very helpful -- hard to see the blue, especially.. Can you post the separate channels as grayscale images? – Paul Brodersen Apr 08 '20 at 10:56
  • I added the images to the post, I hope it makes more sense now. The big light gray is the background non-tissue area which I originally had as black before the thresholding after applying the tissue region mask identified by HistoQC. – AMM Apr 08 '20 at 14:07
  • You should do the deconvolution first, and apply the mask after. This will get rid of the artifacts you see at the edge of the tissue. – Cris Luengo Apr 08 '20 at 14:13
  • The other problem you have is that the stain colors are not estimated properly for the stain unmixing. This causes the region with dark hematoxylin to show up in the AEC channel. – Cris Luengo Apr 08 '20 at 14:15
  • That's a great suggestion but it exposes the deconvolution to many other artifacts that the tissue mask is taking care of -- for example this slide has two tissue segments separated from each other and the deconvolution and thresholding applied on that gives all of the tissue and it's region as positive because the space between tissues was darker (see AEC gray scale and original low res whole slide images added to the post). There could be other artifacts in other slides which could drive some background to make all of the tissue positive or negative. – AMM Apr 08 '20 at 14:49
  • Why is the mask applied first causing these boundary issues? And how do you think either approaches can resolve that? And regarding the unmixing problem, is there anything to do other than trying to exclude the folded tissue regions in the mask step? – AMM Apr 08 '20 at 14:50
  • 1
    The order should be deconvolution -> masking -> thresholding. The masking before deconvolution causes the edge artifacts because you are adding edges to the image that weren't there before. The masking makes the deconvolution worse, never better. – Cris Luengo Apr 13 '20 at 14:44
  • You should exclude the folded regions from your analysis. – Cris Luengo Apr 13 '20 at 14:46
  • 1
    Instead of blurring, I would highly recommend using an LoG or similar operator to enhance the blob detection part. Please refer to these very informative [slides](http://www.cse.psu.edu/~rtc12/CSE486/lecture11.pdf) – Baraa Apr 13 '20 at 18:19
  • 1
    For better channel separation, increase the image contrast. The best result will be if you make the blue pigment is RGB(60, 56, 139), and the red one RGB(141, 57, 57). I calculated these values ​​from the matrix hax_from_rgb. You can also calculate your matrix that is optimal for this image of yours. Removing the border can be done with a mask. A mask is a binarized source image and a 3-5 pixel morphological erosion operation. – Alex Alex Apr 14 '20 at 16:32

4 Answers4

3

There are several issues that cause improper quantification. I'll go over the details of how I would recommend you tackle these slides.

I'm using DIPlib, because I'm most familiar with it (I'm an author). It has Python bindings, which I use here, and can be installed with pip install diplib. However, none of this is complicated image processing, and you should be able to do similar processing with other libraries.

Loading image

There is nothing special here, except that the image has strong JPEG compression artifacts, which can interfere with the stain unmixing. We help the process a bit by smoothing the image with a small Gaussian filter.

import diplib as dip
import numpy as np

image = dip.ImageRead('example.png')
image = dip.Gauss(image, [1]) # because of the severe JPEG compression artifacts

Stain unmixing

[Personal note: I find it unfortunate that Ruifrok and Johnston, the authors of the paper presenting the stain unmixing method, called it "deconvolution", since that term already had an established meaning in image processing, especially in combination with microscopy. I always refer to this as "stain unmixing", never "deconvolution".]

This should always be the first step in any attempt at quantifying from a bightfield image. There are three important RGB triplets that you need to determine here: the RGB value of the background (which is the brightness of the light source), and the RGB value of each of the stains. The unmixing process has two components:

First we apply the Beer-Lambert mapping. This mapping is non-linear. It converts the transmitted light (as recorded by the microscope) into absorbance values. Absorbance indicates how strongly each point on the slide absorbs light of the various wavelengths. The stains absorb light, and differ by the relative absorbance in each of the R, G and B channels of the camera.

background_intensity = [209, 208, 215]
image = dip.BeerLambertMapping(image, background_intensity)

I manually determined the background intensity, but you can automate that process quite well if you have whole slide images: in whole slide images, the edges of the image always correspond to background, so you can look there for intensities.

The second step is the actual unmixing. The mixing of absorbances is a linear process, so the unmixing is solving of a set of linear equations at each pixel. For this we need to know the absorbance values for each of the stains in each of the channels. Using standard values (as in skimage.color.hax_from_rgb) might give a good first approximation, but rarely will provide the best quantification.

Stain colors change from assay to assay (for example, hematoxylin has a different color depending on who made it, what tissue is stained, etc.), and change also depending on the camera used to image the slide (each model has different RGB filters). The best way to determine these colors is to prepare a slide for each stain, using all the same protocol but not putting on the other dyes. From these slides you can easily obtain stain colors that are valid for your assay and your slide scanner. This is however rarely if ever done in practice.

A more practical solution involves estimating colors from the slide itself. By finding a spot on the slide where you see each of the stains individually (where stains are not mixed) one can manually determine fairly good values. It is possible to automatically determine appropriate values, but is much more complex and it'll be hard finding an existing implementation. There are a few papers out there that show how to do this with non-negative matrix factorization with a sparsity constraint, which IMO is the best approach we have.

hematoxylin_color = np.array([0.2712, 0.2448, 0.1674])
hematoxylin_color = (hematoxylin_color/np.linalg.norm(hematoxylin_color)).tolist()
aec_color = np.array([0.2129, 0.2806, 0.4348])
aec_color = (aec_color/np.linalg.norm(aec_color)).tolist()
stains = dip.UnmixStains(image, [hematoxylin_color, aec_color])
stains = dip.ClipLow(stains, 0) # set negative values to 0
hematoxylin = stains.TensorElement(0)
aec = stains.TensorElement(1)

Note how the linear unmixing can lead to negative values. This is a result of incorrect color vectors, noise, JPEG artifacts, and things on the slide that absorb light that are not the two stains we defined.

Identifying tissue area

You already have a good method for this, which is applied to the original RGB image. However, don't apply the mask to the original image before doing the unmixing above, keep the mask as a separate image. I wrote the next bit of code that finds tissue area based on the hematoxylin stain. It's not very good, and it's not hard to improve it, but I didn't want to waste too much time here.

tissue = dip.MedianFilter(hematoxylin, dip.Kernel(5))
tissue = dip.Dilation(tissue, [20])
tissue = dip.Closing(tissue, [50])
area = tissue > 0.2

Identifying tissue folds

You were asking about this step too. Tissue folds typically appear as larger darker regions in the image. It is not trivial to find an automatic method to identify them, because a lot of other things can create darker regions in the image too. Manual annotation is a good start, if you collect enough manually annotated examples you could train a Deep Learning model to help you out. I did this just as a place holder, again it's not very good, and identifies some positive regions as folds. Folds are subtracted from the tissue area mask.

folds = dip.Gauss(hematoxylin - aec, [20])
area -= folds > 0.2

Identifying positive pixels

It is important to use a fixed threshold for this. Only a pathologist can tell you what the threshold should be, they are the gold-standard for what constitutes positive and negative.

Note that the slides must all have been prepared following the same protocol. In clinical settings this is relatively easy because the assays used are standardized and validated, and produce a known, limited variation in staining. In an experimental setting, where assays are less strictly controlled, you might see more variation in staining quality. You will even see variation in staining color, unfortunately. You can use automated thresholding methods to at least get some data out, but there will be biases that you cannot control. I don't think there is a way out: inconsistent stain in, inconsistent data out.

Using an image-content-based method such as Otsu causes the threshold to vary from sample to sample. For example, in samples with few positive pixels the threshold will be lower than other samples, yielding a relative overestimation of the percent positive.

positive = aec > 0.1 # pick a threshold according to pathologist's idea what is positive and what is not

pp = 100 * dip.Count(dip.And(positive, area)) / dip.Count(area)
print("Percent positive:", pp)

I get a 1.35% in this sample. Note that the % positive pixels is not necessarily related to the % positive cells, and should not be used as a substitute.

intermediate and final images produced by the code above

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • Thank you. Can you clarify the unmixing procedure elements (especially the need for dip.BeerLambertMapping which typically wasn't shown in the tutorials I've read, as well as the use of np.linalg.norm)? Am I seeing correctly that dense positive stain cell areas are counted as folds? – AMM Apr 20 '20 at 18:22
  • @Assafb: I recommend that you follow the link for Beer-Lambert that I included. I think it explains the process quite well. In short, stains mixing in the tissue is a linear process, but these stains absorb light, which is a non-linear process. The camera sees the light that is not absorbed. We need to go from that quantity to a quantity, absorbance, that is linear to the amount of stain present in the tissue. This is what the Beer-Lambert mapping function does. Once in the linear world, we can do the unmixing by solving a set of linear equations. – Cris Luengo Apr 20 '20 at 20:25
  • I don't know what tutorials you've read, but this is the core of the Ruifrok & Johnston paper, without that step you cannot do unmixing. I am assuming that `skimage.color.separate_stains` includes this step internally. The problem is that that function doesn't take the illumination intensity into account, which is most important to do the mapping correctly. – Cris Luengo Apr 20 '20 at 20:27
  • @Assafb: the normalization of the color vectors is important for the stain unmixing. The colors I measured (the constants given in the code) are not normalized, they are the values of a single pixel that I thought was representative for the color. By normalizing these color vectors you get better unmixing, otherwise if the magnitude of the vectors differs then you might introduce a bias in the unmixing. – Cris Luengo Apr 20 '20 at 20:29
  • So the linear unmixing, which identifies the coefficients explaining the combination of the supplied color matrix, needs to be applied to the quantity of absorbed light, which is the opposite of what the camera is seeing. Using Beer-Lambert law we can calculate that absorbed light as a function of the absorptivity of the material, optical path length and concentration (which are unknown here?) – AMM Apr 20 '20 at 22:25
  • No, the linear unmixing must be applied to the absorbance, which is not the same as the amount of absorbed light. The amount of absorbed light has a non-linear relationship to the absorbance, as shown by Beer and Lambert. Absorbance is a combination of absorptivity, path length and concentration, as you say, and we cannot tease those things apart, but we don't need to. Absorptivity is described by the color vector, it's a characteristic of the dye used. Path length and concentration together give the amount of stain present at a location on the slide. – Cris Luengo Apr 20 '20 at 22:29
  • I also looked into the skimage function separate stains in which they appear to perform matrix multiplication of the image with the stain matrix and apply power of 10 transformation. So this is lacking the step we're discussing here? https://github.com/scikit-image/skimage-experiments/blob/d7694bacb43f5246662ead4b1a7a3bd3d7a196f5/skimage/color/colorconv.py#L1352 – AMM Apr 20 '20 at 22:32
  • @Assafb: The negative `log10` transformation is the Beer-Lambert mapping. What I don't understand in that code is the line `rgb += 2`. I don't know where that comes from or why they use it. I'm 100% sure it is wrong. I guess they're trying to avoid `log10(0)`, but that should be done differently. I bet this is where your negative values come from, though. – Cris Luengo Apr 20 '20 at 22:37
  • @Assafb: Compare the implementation in skimage with what is described in [the original paper](http://helios.mi.parisdescartes.fr/~lomn/Data/2017/Color/Quantification_of_histochemical_staining.pdf). You'll see several errors in the implementation, most importantly the lack of a division by the max intensity. They should have used `-np.log10(rgb/255)` (assuming that 255 is the illumination intensity), rater than `-np.log10(rgb)`. – Cris Luengo Apr 20 '20 at 22:47
  • @Assafb: "Am I seeing correctly that dense positive stain cell areas are counted as folds?" -- yes, those regions also have stronger Hematoxylin staining, my primitive fold detection doesn't work well, don't copy it. It was just for illustrative purposes. It would take quite a bit of effort to create a good automated algorithm, as I mention in the post. Manually annotating out folds is still a common procedure. – Cris Luengo Apr 20 '20 at 22:50
3

@cris-luengo thank you for your input on scikit-image! I am one of the core developers, and based on @assafb input, we are trying to rewrite the code on color/colorconv/separate_stains.

@Assafb: The negative log10 transformation is the Beer-Lambert mapping. What I don't understand in that code is the line rgb += 2. I don't know where that comes from or why they use it. I'm 100% sure it is wrong. I guess they're trying to avoid log10(0), but that should be done differently. I bet this is where your negative values come from, though.

Yes, apparently (I am not the original author of this code) we use rgb += 2 to avoid log10(0). I checked Fiji's Colour Deconvolution plugin, and they add 1 to their input. I tested several input numbers to help on that, and ~2 would let us closer to the desirable results.

@Assafb: Compare the implementation in skimage with what is described in the original paper. You'll see several errors in the implementation, most importantly the lack of a division by the max intensity. They should have used -np.log10(rgb/255) (assuming that 255 is the illumination intensity), rater than -np.log10(rgb).

Our input data is float; the max intensity in this case would be 1. I'd say that that's the reason we don't divide by something.

Besides that, I opened an issue on scikit-image to discuss these problems — and to specify a solution. I made some research already — I even checked DIPlib's documentation —, and implemented a different version of that specific function. However, stains are not my main area of expertise, and we would be glad if you could help evaluating that code — and maybe pointing a better solution. Thank you again for your help!

  • 1
    I’m glad to help. But next time please find me by email or through GitHub (my profile here links to my profile on GitHub, and my profile there has my email address). This is not an appropriate way to use Stack Overflow, answers are supposed to answer the question at the top, this is not a message board. I’ve subscribed to your PR and will read it in more detail later tonight. – Cris Luengo May 16 '20 at 00:25
  • It's actually all @CrisLuengo insightful input, I was just learning from him. Thanks Cris. – AMM May 16 '20 at 00:53
2

I ended up incorporating some of the feedback given above by Chris into the following possible unconventional solution for which I would appreciate getting feedback (to the specific questions below but also general suggestions for improvement or more effective/accurate tools or strategy):

  1. Define (but not apply yet) tissue mask (HistoQC) after optimizing HistoQC script to remove as much of the tissue folds as possible without removing normal tissue area
  2. Apply deconvolution on the original RGB image using hax_from_rgb
  3. Using the second channel which should correspond to the red stain pixels, and subtract from it the third channel which as far as I see corresponds to the background non-red/blue pixels of the image. This step removes the high values in the second channel that which up because of tissue folds or other artifacts that weren't removed in the first step (what does the third channel correspond to? The Green element of RGB?)
  4. Blur the adjusted image and threshold based on the median of the image plus 20 (Semi-arbitrary but it works. Are there better alternatives? Otsu doesn't work here at all)
  5. Apply the tissue regions mask on the thresholded image yielding only positive red/red-ish pixels without the non-tissue areas
  6. Count the % of positive pixels relative to the tissue mask area

I have been trying to apply, as suggested above, the tissue mask on the deconvolution red channel output and then use Otsu thresholding. But it failed since the black background generated by the applying the tissue regions mask makes the Otsu threshold detect the entire tissue as positive. So I have proceeded instead to apply the threshold on the adjusted red channel and then apply the tissue mask before counting positive pixels. I am interested in learning what am I doing wrong here.

Other than that, the LoG transformation didn't seem to work well because it produced a lot of stretched bright segments rather than just circular blobs where cells are located. I'm not sure why this is happening.

AMM
  • 2,195
  • 2
  • 20
  • 28
  • @CrisLuengo you advised to do 'deconvolution -> masking -> thresholding' and 'exclude the folded regions'. I did apply the deconvolution before masking, but I had to apply the thresholding before the masking because otherwise the black background will be the driver of the Otsu threshold and (2) I don't have a method to completely exclude the folded regions with 100% accuracy, so I applied the subtraction mentioned above to clean up the remaining small folds from the deconvolved image. I clarified the description above to reflect this process better. What exactly is wrong in what I'm doing? – AMM Apr 18 '20 at 20:23
  • Am I missing a strategy to completely exclude folds even if they're small or light or alternatively make Otsu ignore the background or identify more than one threshold? – AMM Apr 18 '20 at 21:00
  • And do I need to normalize the images to have a comparable statistic of % positive pixels across different stains of different samples? – AMM Apr 18 '20 at 23:56
  • You should not normalize, and you should actually not use an automatic threshold method like Otsu. Instead, find the right threshold value and keep it fixed for all slides prepared in the same way (same assay). If the threshold depends on the image content, slides with higher % positive will get a different threshold than those with a low % positive, and you'll have a biased measurement of the % positive. – Cris Luengo Apr 19 '20 at 00:26
  • If you post the cropped tissue region without any processing (especially before applying the mask), then I can show you how to properly unmix, mask and threshold the image. – Cris Luengo Apr 19 '20 at 00:27
  • @CrisLuengo Here's an example of a tissue with tough properties (both light and dark folds, blurry areas and shadow around the bottom left). Ideally I would be able to measure the % positive out of the total usable tissue area (meaning the non folded or blurry areas). Thanks for your advice. https://www.dropbox.com/transfer/AAAAAG7iZf8Pw8eZVHrg8y6gOuIuTNIz44Zf7ZsusdJFfsU8EDd9aj4 – AMM Apr 19 '20 at 02:09
  • Also, regarding the normalization -- it makes sense that a threshold adjustable by the properties of the image could induce a bias, but how can we account for cases like 'lighter stain' because of experimental biases? – AMM Apr 19 '20 at 02:16
  • Sorry, I initially misread your post. I hope that my answer helps in some way. – Cris Luengo Apr 19 '20 at 08:07
1

Use ML for this case.

  1. Create manually binary mask for your pictures: each red pixel - white, background pixels - black.
  2. Work in HSV or Lab color space.
  3. Train simple classifier: decision tree or SVM (linear or with RBF)..
  4. Let's test!

See on a good and very simple example with skin color segmentation.

And in the future you can add new examples and new cases without code refactoring: just update dataset and retrain model.

Nuzhny
  • 1,869
  • 1
  • 7
  • 13
  • Thanks but I don't have the capacity to annotate these many (and huge) histopathology files – AMM Apr 16 '20 at 15:01
  • Why not? you can annotate one file, train classifier and use it for annotation second file. And you need to have annotation anyway for testing some another algorithm. – Nuzhny Apr 16 '20 at 19:16
  • Because I tens of stain/tissue/cell types/distinct morphologies and millions of cells on a slide – AMM Apr 18 '20 at 17:43
  • 1
    Let’s throw ML at every problem, even the ones that have been solved decades ago with simple algorithms and well-known linear algebra. That makes sense! – Cris Luengo Apr 18 '20 at 18:39
  • ML is also "decades ago simple algorithms and well-known linear algebra". SVM - 1992 year. ML is not magic it helps to automatic get constants and thresholds for a big data. – Nuzhny Apr 19 '20 at 06:09
  • @Assafb How do you advice test your classic CV algorithm? You must annotate data and write tests. And you get new images and your previous algorithm don't work - what will you do? Change some thresholds and constants? Are you sure that it will be good for previous "million cells"? – Nuzhny Apr 19 '20 at 06:13