6

I am trying to train a CNN model to classify images based on their aesthetic score. There are 2,00,000 images and every image is rated by more than 100 subjects. Mean score is calculated and the scores are normalized.

enter image description here

The distribution of the scores is approximately gaussian. So I have decided to build a 10 class classification model after assigning appropriate weight for each class as the data is imbalanced.

My question:

For this problem, the scores are continuous, ie, 0<0.2<0.3<0.4<0.5<..<1. Then does that mean this is a regression problem? If so, how do I balance the data for a regression problem, as most of the datapoints are present in between 0.4 and 0.6.

Thanks!

AKSHAYAA VAIDYANATHAN
  • 2,715
  • 7
  • 30
  • 51

2 Answers2

2

Since your labels are continuous, you could divide them in to 10 equal quantiles using a technique like pandas.qcut() and provide label to each classes. This can turn a regression problem to a classification problem.

And as far as the imbalance is concerned, you may want to try to oversample the minority data. This will ensure your model is not biased towards majority data.

Hope this helps.

Sagar Dawda
  • 1,126
  • 9
  • 17
  • applying qcut() sounds like an amazing idea. But I am not sure how far that will work for this problem. I will try it out and let you know about the performance of the classifier model. Thanks for the solution. – AKSHAYAA VAIDYANATHAN Apr 12 '18 at 09:57
  • Though the model gets confused between few intermediate classes (may be i should try reducing the number of classes), it is better than a regression model or classification model trained on the dataset with gaussian distributed target values. Thank-you again for the solution – AKSHAYAA VAIDYANATHAN Apr 12 '18 at 11:09
  • 1
    My pleasure Akshayaa. Yes you could create a pipeline for 3 to 7 classes and record results in a list and pick up the one which outperforms the others. – Sagar Dawda Apr 12 '18 at 14:44
0

I would recommend you to do a Histogram Equalization over ALL data of your participants first, so that their ratings are destributed equaly.

Then for each image in your training set calculate the Expected Value (and if you also want to, the Variance) The Expected Value is just the mean of the votes. For the Variance there are standard functions in (almost) every programming language where you can input an array of votes which will output the Variance.

Now take the Expected Value (and if you want also the Variance) as your ground truth for your Network.


EDIT: Histogram Equalization:

Histogram equalization is a method to use the given numerical range as efficient as possible.

In the context of images, this would change the pixel values, so that the darkest pixel becomes the value 0 and the lightest value becomes 255. Furthermore every grayscale value gets destributed so that it occurs as often as each other (in average). For your dataset you want the same. Even though your values are not from 0 to 255 but from 0 to 10. Furthermore you don't need to (and shoudn't) round the resulting values to integers. In this way more often occurring votes are more spread and less often votes are contracted.

Maybe you should first calculate the expected value and than do the histogram equalization over the expected values of all images.

By this the CNN sould be able to better differentiate those small differences.

cagcoach
  • 625
  • 7
  • 24
  • Thanks for your response. As far as I understand, histogram equalization is a method to adjust contrasts between images. How can this be applied to the distribution of user ratings? – AKSHAYAA VAIDYANATHAN Apr 11 '18 at 12:54