1

I have a dataset consisting of some 17,000 images, each of size 3000x1700.

I have two methods for computing the mean and standard deviation of my dataset: a pixel-wise approach and an image-wise approach. The pixel-wise approach simply computes the mean and std of all the pixels in the dataset, (using Welford's online algorithm since there are too many pixels in total to store them all in memory at once). The image-wise approach computes the means and variances of all the images in the dataset, then averages them (and applies sqrt to the average variance to get the standard deviation).

These two approaches will, in general, obviously not return the same results, since the image-wise approach will give equal weight to images of different sizes. However, all the images in my dataset are of the exact same size, so in my case, the two algorithms should return the same mean and std values.

import numpy as np
import os
from PIL import Image


class RunningStatisticsVar:
    def __init__(self, ddof=0):
        self.mean = 0
        self.var = 0
        self.std = 0

        self._n = 0
        self._s = 0
        self._ddof = ddof

    def update(self, values):
        values = np.array(values, ndmin=1)
        n = len(values)

        self._n += n

        old_mean = self.mean
        delta = values - self.mean
        self.mean += (delta / self._n).sum()

        self._s += (delta * (values - self.mean)).sum()
        self.var = self._s / (self._n - self._ddof) if self._n > self._ddof else 0
        self.std = np.sqrt(self.var)

    def __str__(self):
        if self.std:
            return f"{self.name} (\u03BC \u00B1 \u03C3): {self.mean} \u00B1 {self.std}"
        else:
            return f"{self.name}: {self.mean}"


def calculate_mean_and_std(source_dir, pixelwise=True):
    source_fs = [os.path.join(source_dir, f) for f in os.listdir(source_dir)]

    if pixelwise:
        stats = {colour: RunningStatisticsVar() for colour in 'rgb'}
        for source in source_fs:
            _process_image_pixelwise(source, stats)
            print(f"\u03BC: {[stats[colour].mean for colour in 'rgb']}")
            print(f"\u03C3: {[stats[colour].std for colour in 'rgb']}")

    else:
        means, vars = zip(*(_process_image(source) for source in source_fs))
        means = np.array(means)
        vars = np.array(vars)
        print(f"\u03BC: {[[means[:i+1, c].mean() for c in range(3)] for i in range(len(means))]}")
        print(f"\u03C3: {[[np.sqrt(vars[:i+1, c].mean()) for c in range(3)] for i in range(len(vars))]}")


def _process_image(source_f):
    img = np.array(Image.open(source_f))
    return img.mean(axis=(0,1)), img.var(axis=(0,1))


def _process_image_pixelwise(source_f, stats):
    img = np.array(Image.open(source_f))
    for c, colour in enumerate('rgb'):
        stats[colour].update(img[:, :, c].flatten())


calculate_mean_and_std('/path/to/dataset', True)
calculate_mean_and_std('/path/to/dataset', False)

Now, when running this code, it starts off fine, after the first image it reports:

Pixelwise
μ: [106.0911049019608, 67.80728647058824, 45.90995117647062]
σ: [41.59660208236723, 34.5791272546266, 26.781512448936052]
Imagewise
μ: [106.09110490196079, 67.80728647058824, 45.909951176470585]
σ: [41.596602082409774, 34.579127254617084, 26.78151244893938]

Only a slight aberration in the σ values. However this error gets bigger and bigger as more images are used.

Pixelwise
μ: [101.21394647058824, 65.62210166666667, 46.48841911764708]
σ: [40.41893639932673, 33.3795015706431, 26.946594699203892]
μ: [101.21668875816994, 65.61455176470588, 45.89977104575165]
σ: [39.97502569826382, 33.022993336061994, 27.14311280813053]
μ: [102.36255480392157, 65.52340710784313, 45.7435418137255]
σ: [39.24577507415308, 32.14645170321761, 26.46386419015868]
μ: [103.53420298039215, 66.6776919607843, 47.33793435294118]
σ: [39.50097527065126, 32.707463761715545, 27.354952039420294]
μ: [101.47019483660131, 64.78871330065358, 45.96562905228758]
σ: [39.004169349180536, 32.23864799529698, 26.801524930594407]
μ: [101.37475316526611, 64.89486834733893, 45.781395238095236]
σ: [39.01205310849703, 31.98696402185754, 26.461034312514027]
μ: [100.67556311274511, 64.35973938725489, 45.18712225490196]
σ: [39.4077244026275, 32.09956875358911, 26.236057166690983]
μ: [99.85794368191722, 63.74056908496731, 44.48019496732026]
σ: [39.711590893165045, 32.16450280203762, 26.02186219040715]
μ: [99.51202813725492, 63.552036745098036, 44.69082580392157]
σ: [39.67742805536442, 32.360665575523704, 26.47325733239239]
μ: [99.68222427807488, 63.67247614973262, 44.77697739750446]
σ: [39.704289274013554, 32.212222956614625, 26.25452243516258]
Imagewise
μ: [101.21394647058824, 65.62210166666667, 46.488419117647055]
σ: [40.12360583609732, 33.30789834971034, 26.940384940174805]
μ: [101.21668875816994, 65.61455176470588, 45.899771045751635]
σ: [39.77618485504031, 32.97475731200615, 27.126232255358936]
μ [102.36255480392157, 65.52340710784313, 45.74354181372549]
σ: [39.04354601878276, 32.108905824505584, 26.44949551028786]
μ: [103.53420298039217, 66.6776919607843, 47.33793435294118]
σ: [39.27047375795457, 32.596298448583816, 27.157260809967887]
μ: [101.47019483660131, 64.78871330065358, 45.96562905228759]
σ: [38.53431972571801, 31.865963477681383, 26.4560983662707]
μ: [101.37475316526611, 64.89486834733893, 45.78139523809524]
σ: [38.60904936868449, 31.66418210935895, 26.157487873733803]
μ: [100.6755631127451, 64.3597393872549, 45.18712225490196]
σ: [39.01506478558764, 31.78679779866631, 25.920704631785057]
μ: [99.85794368191722, 63.74056908496732, 44.48019496732026]
σ: [39.29746139951266, 31.839074510873296, 25.66162709110527]
μ: [99.5120281372549, 63.55203674509804, 44.69082580392156]
σ: [39.290881869376754, 32.064732326778504, 26.14723081778954]
μ: [99.68222427807487, 63.672476149732624, 44.77697739750445]
σ: [39.34959988736086, 31.939785095367306, 25.95437650452569]

and finally, after running the procedure on the entire dataset, the error becomes quite significant:

Pixelwise
μ: [101.8700454273955, 74.8429931459155, 64.76057496565133]
σ: [48.662522310391594, 41.730824001340146, 39.50453890881377]
Imagewise
μ: [101.87004542739516, 74.84299314591567, 64.76057496565105]
σ: [36.19880052359236, 32.38430878761532, 31.063098948011394]

As you can see, the means remain pretty much identical with both approaches even when the entire dataset is used. However, the standard deviation results are very far apart, even though they should be the same for both approaches. Which of my two approaches is at fault here and why?

Mate de Vita
  • 1,102
  • 12
  • 32
  • 1
    https://en.wikipedia.org/wiki/Law_of_total_variance – Paul Panzer Nov 02 '19 at 01:06
  • I don't really understand what this means. I only have one random variable that I'm trying to compute the std for (the pixel value). What would my X be here? – Mate de Vita Nov 02 '19 at 13:13
  • 1
    Maybe a trivial example may help: Imagine each of your images is just a single color, but different colors for different images. Then your per picture method would yield std=0 because in each picture the variance is zero. Obviously this is wrong because it neglects across picture variance (the var(E) term in the law of total variance). – Paul Panzer Nov 02 '19 at 13:20
  • Huh, that's a good point. I was kind of following the answer given in here: https://stats.stackexchange.com/a/26647/130181 . Can you explain the difference between the two problems? From what I can tell the two are almost identical, aside from changing months to images and MHw to pixel values. – Mate de Vita Nov 03 '19 at 16:06
  • The question there is not 100% clear. Either it is as you say the same as your problem, then the accepted answer is just wrong. Or they are interested in _summing_ not _pooling_ the data. If you are summing then just summing the variances is correct. – Paul Panzer Nov 03 '19 at 16:24
  • I should say summing the variances is correct if the variables you are summing are uncorrelated. Which for those windfarms seems a questionable assumption, I'd imagine there are correlations between the weather of consecutive months. – Paul Panzer Nov 03 '19 at 16:47
  • 1
    I've updated the code by changing `np.sqrt(vars[:i+1, c].mean())` to `np.sqrt(vars[:i+1], c].mean() + means[:i+1, c].var())` and it now seems to work correctly. Thanks! – Mate de Vita Nov 04 '19 at 13:24

0 Answers0