How to adjust for the grading bias of labels in a classification task?

Question

I am currently working on a convolutional neural network for pathological changes detection on x-ray images. It is a simple binary classification task. In the beginning of the project we gathered around 6000 x-rays and asked 3 different doctors (domain experts) to label them. Each of them got around 2000 randomly selected images (and those 3 sets were separable - one image was labelled only by one doctor).

After the labelling was finished I wanted to check how many cases per doctor were labelled as having and non-having the changes and this is what I've got:

# A tibble: 3 x 3
  doctor no_changes (%) changes (%)
   <int>      <dbl>       <dbl>
1      1       15.9        84.1
2      2       54.1        45.9
3      3       17.8        82.2

From my perspective, if each of the doctors got a randomly sampled dataset of x-rays, the % of cases with and without changes should be pretty much the same for each of them, assuming that they are "thinking similarly", which isn't the case in here.

We were talking with one of the doctors and he told us that it's possible that one doctor can say that there are changes on the x-ray and another can say something different, because typically they're not looking at changes in a binary way - so for example amount/size of changes could decide in labelling and each of the doctors could have a different cutoff in the mind.

Knowing that I started thinking about removing/centering labels bias. This is what I come up with:

Because I know doctor 1 (let's say he is the best expert) I decided to "move" labels of doctor 2 and 3 into direction of doctor 1.
I gathered 300 new images and ask all 3 of them to label them (so each image was labelled by 3 different doctors this time). Than I've checked the distribution of labels between doctor 1 and 2/3. For example for doctor 1 and 2 I got something like:

doctor2             no_changes changes all
doctor1  no_changes 15         3       18
         changes    154        177     331
         all        169        180

From this I can see that doctor 2 had 169 cases that he lebeled as not having changes and doctor 1 agreed with him only in 15 cases. Knowing that I've changed labels (probabilities) for doctor 2 in non-changes case from [1, 0] to [15/169, 1- 15/169]. Similarly doctor 2 had 180 cases of changes in x-rays and doctor 1 agreed with him in 177 cases so I've changed labels (probabilities) for doctor 2 in changes case from [0, 1] to [1 - 177/180, 177/180].

Do the same thing for doctor 3

Doing that I've retrained neural network with cross-entropy loss.

My question is, is my solution correct or should I do something differently? Are the any other solutions for this problem ?

Is it strictly binary classification and `changes` probability is always `1 - no_changes` probability? If so, you could just maintain either of those values. It doesn't change the results but simplifies the reasoning. — pkubik, Mar 20 '20 at 23:17

score 1 · Accepted Answer · answered Mar 21 '20 at 01:09

It looks correct.

With cross-entropy you actually compare the probability distribution output by your model with some reference probability P(changes = 1). In binary classification we usually assume that our training data follow empirical distribution, which yield either 1.0 or 0.0 depending on the label. As you already note this does not need to be the case, e.g. in case when we do not have full confidence in our data.

You can express your reference probability as:

P(changes = 1) = P(changes = 1, doc_k = 0) + P(changes = 0, doc_k = 1)

We just marginalize all possible k-th doctor decisions. It's similar for P(changes = 0). Each joint distribution can be further expanded:

P(changes = 1, doc_k = L) = P(changes = 1 | doc_k = X) P(doc_k = L)

The conditional is a constant that you are computing by comparing each doctor with oracle doctor 1. I cannot think of a better way to approximate this probability given the data you have. (You could, however, try to improve it with some additional annotations). The P(doc_k = X) probability is just 0 or 1, because we know for sure what annotation has been given by each doctor.

All those expansions match your solution. For an example with no changes detected by the 2nd doctor:

P(changes = 0) = P(changes = 0 | doc_2 = 0) * 1 + 0 = 15/169

and for an example with changes:

P(changes = 1) = 0 + P(changes = 1 | doc_2 = 1) * 1 = 177/180

In both cases the constants 0 and 1 come from value of probability P(doc_2 = L).

How to adjust for the grading bias of labels in a classification task?

1 Answers1