0

I have looked into how to calculate the weighted standard deviation or variance of a sample and found this very helpful post referencing the paper by Gatz and Smith that suggests several methods: https://math.stackexchange.com/questions/823125/sampling-error-with-weighted-mean

Now I am trying to understand how Matlab calculates the weighted variance when using "std(A,w)". I looked through the Matlab help and the original code using "edit var", but failed to understand the syntax of how the weighted variance is calculated. Could someone write down the equation for me or describe the crucial lines of below code in a few words? (you find the complete code of the function when you type "edit var" in Matlab).

% Weighted variance

else

if ~isvector(w) || ~isreal(w) || ~isfloat(w) || ...
   (omitnan && ~all(w(~isnan(w)) >= 0)) || (~omitnan && ~all(w >= 0))
    error(message('MATLAB:var:invalidWgts'));
end

if numel(w) ~= n
    if isscalar(w)
        error(message('MATLAB:var:invalidWgts'));
    else
        error(message('MATLAB:var:invalidSizeWgts'));
    end
end

if ~omitnan

    % Normalize W, and embed it in the right number of dims.  Then
    % replicate it out along the non-working dims to match X's size.
    wresize = ones(1,max(ndims(x),dim)); wresize(dim) = n;
    w = reshape(w ./ sum(w), wresize);
    y = sum(w .* abs(x  - sum(w .* x, dim)).^2, dim); % abs guarantees a real result

else
    % Repeat vector W, such that new W has the same size as X
    sz = size(x); sz(end+1:dim) = 1;
    wresize = ones(size(sz)); wresize(dim) = sz(dim);
    wtile = sz; wtile(dim) = 1;
    w = repmat(reshape(w, wresize), wtile);

    % Count up non-NaN weights at non-NaN elements
    w(isnan(x)) = NaN;
    denom = sum(w, dim, 'omitnan'); % contains no NaN, since w >= 0

    x = x - (sum(w .* x, dim, 'omitnan') ./ denom);
    wx2 = w .* abs(x).^2;
    y = sum(wx2, dim, 'omitnan') ./ denom; % abs guarantees a real result

    % Don't omit NaNs caused by computation (not missing data)
    ind = any(isnan(wx2) & ~isnan(w), dim);
    y(ind) = NaN;
LenaH
  • 313
  • 2
  • 14
  • I'm not sure this question is on-topic here. If you fail to understand the *mathematics* behind the function, then [math.se] would be the correct place to ask. If you fail to understand the *syntax* of MATLAB's implementation, please include a sample of the code in your question here and point out what you fail to grasp. – Adriaan Aug 20 '19 at 08:44
  • Good point - it's the syntax I don't get. See my edits above. – LenaH Aug 20 '19 at 11:33
  • 2
    The code is already explained by the comments, it's only the application of the mathematical formula of the weighted standard deviation. The code is a bit more complicated due to the fact that `x` can have an infinite number of dimension (but the standard deviation will only be applied along one dimension). Basically the code does that `sqrt(sum((x-mean(x)).^2.*w(:))/sum(w))` – obchardon Aug 20 '19 at 15:12
  • Thanks, that's the kind of explanation that I needed. If you put it in a reply, I'll accept it. – LenaH Aug 20 '19 at 20:33

1 Answers1

0

So I believe the equation mentioned earlier is slightly incorrect.

sqrt(sum((x-mean(x)).^2.*w(:))/sum(w)) % slightly off

It needs to be corrected by weighting the data based on it's length to match MATLAB's output of std(), and to match the base STD equation of:

sqrt(sum((x(:)-mean(x)).^2)/(length(x)-1)) % base STD equation

By accounting for the length we obtain the following:

sqrt(sum((x(:)-mean(x)).^2.*w(:))/sum(w)*length(x)/(length(x)-1)) % corrected

The example below demonstrates the differences:

% define weighted std functions
std_fcn_a = @(x,w) sqrt(sum((x(:)-mean(x)).^2.*w(:))/sum(w));
std_fcn_b = @(x,w) sqrt(sum((x(:)-mean(x)).^2.*w(:))/sum(w)*length(x)/(length(x)-1));

% create data
x = reshape(magic(3),[1,9]);    % create example data
w0 = ones(size(x))/length(x);   % create uniform weighting
rng(0);                         % seed random data
w1 = rand(size(x));             % create random weights
w1 = w1/sum(w1);                % normalize weights

% test unweighted
std_0 = std(x)                  % 2.7386
std_0_v2 = std(x,w0)            % 2.5820
std_a_0 = std_fcn_a(x,w0)       % 2.5820
std_b_0 = std_fcn_b(x,w0)       % 2.7386

% test weighted
std_a_1 = std_fcn_a(x,w1)       % 2.6963
std_b_1 = std_fcn_b(x,w1)       % 2.8599

Example output

std_0 = 2.7386
std_0_v2 = 2.5820
std_a_0 = 2.5820
std_b_0 = 2.7386
std_a_1 = 2.6963
std_b_1 = 2.8599

The unweighted test shows a strange result being that std(x) is not the same as std(x,w) if w is uniform. I initially assumed it should be since the std(x) would have a uniform weighting and would therefore be equivalent. This brings me the question myself and the validity of std(x,w).

sebslaman
  • 3
  • 3