0

I have four 1-D arrays of dependent variables. They contain hundreds of data points but I have cropped them to 20 in this example. Each point represents a grid cell on a map.

import numpy as np

A=np.asarray([0.195, 0.154, 0.208, 0.22, 0.204, 0.175, 0.184, 0.187, 0.171, 0.2, 0.222, 0.235, 0.206, 0.215, 0.222, 0.252, 0.269, 0.251, 0.285, 0.28])
B=np.asarray([0.119, 0.134, 0.132, 0.121, 0.11, 0.097, 0.13, 0.106, 0.103, 0.139, 0.124, 0.147, 0.152, 0.123, 0.177, 0.172, 0.18, 0.182, 0.197, 0.193])
C=np.asarray([0.11, 0.1, 0.103, 0.111, 0.105, 0.098, 0.099, 0.093, 0.105, 0.099, 0.113, 0.093, 0.104, 0.095, 0.099, 0.105, 0.108, 0.128, 0.125, 0.118])
D=np.asarray([-0.015, -0.015, -0.007, -0.02, 0.002, 0.009, 0.019, 0.0, -0.02, -0.001, -0.006, -0.015, -0.03, -0.036, -0.051, -0.058, -0.065, -0.081, -0.082, -0.055])

The error on each variable is contained within the arrays:

A_err=np.asarray([ 0.016,  0.015,  0.017,  0.016,  0.015,  0.016,  0.016,  0.018, 0.015,  0.014,  0.015,  0.016,  0.017,  0.016,  0.017,  0.017, 0.017,  0.017,  0.017,  0.017])
B_err=np.asarray([ 0.045,  0.049,  0.039,  0.044,  0.036,  0.027,  0.032,  0.033, 0.029,  0.036,  0.032,  0.027,  0.04 ,  0.022,  0.034,  0.026, 0.021,  0.028,  0.035,  0.028])
C_err=np.zeros(20)+0.7
D_err=np.zeros(20)+0.9

Y is the sum of A, B, C and D:

Y=A+B+C+D

Clearly Y will have a different value in each of the 20 grid cells. What I am trying to calculate is the error on each Y measurement. Initially I was calculating this simply as:

Y_err=np.sqrt(A_err**2 + B_err**2 + C_err**2 + D_err**2)

But this includes no covariance terms. Since I do not have expressions for the variables A, B, C and D in terms of one another, the only way I can see to calculate the covariances is to use the correlation-covariance matrix:

X = np.vstack([A,B,C,D])
C = np.cov(X)
print(C)
[[  1.31819737e-03   9.52921053e-04   2.17881579e-04  -8.58197368e-04]
[  9.52921053e-04   9.87252632e-04   1.69478947e-04  -7.97089474e-04]
[  2.17881579e-04   1.69478947e-04   9.47868421e-05  -1.88007895e-04]
[ -8.58197368e-04  -7.97089474e-04  -1.88007895e-04   8.85081579e-04]]

I am unclear as to whether I can then simply add each of the six non-diagonal matrix terms:

σ_AB=9.52921053e-04 
σ_AC=2.17881579e-04
σ_AD=-8.58197368e-04
σ_BC=1.69478947e-04
σ_BD=-7.97089474e-04
σ_CD=-1.88007895e-04 

into the equation for the propagation of error, such that:

Y_err=np.sqrt(A_err**2 + B_err**2 + C_err**2 + D_err**2 + 2*σ_AB + 2*σ_AC + 2*σ_AD + 2*σ_BC + 2*σ_BD + 2*σ_CD)

Or this totally invalid?

izzyrizzy
  • 43
  • 4
  • But I do not need the variances (diagonal terms) to calculate the error do I? – izzyrizzy Jul 10 '18 at 08:32
  • You definitely need the diagonal terms. These are the sigma(a)^2 like terms in your above formula. Both variance and covariance terms contribute to variance in Y according to your linear combination. – Mankind_008 Jul 10 '18 at 08:37
  • But then I will only have a single value for the error on Y. What I want is for Y to be a 20-point array, with a magnitude and error at each point. So rather than using the variances, I am using the values in the arrays A_err, B_err, C_err, D_err, i.e I am using sigma(a)^2 = (A_err[i])^2 etc. So the values for sigma(a), sigma(b), sigma(c), sigma(d) will vary whilst the values for the covariance terms are fixed. Is that wrong? – izzyrizzy Jul 10 '18 at 10:05
  • There is a confusion in your question. how did you exactly arrive with error values corresponding to each of 20 observations for eg. `A_err`, `B_err` for `A`, `B` ?? – Mankind_008 Jul 10 '18 at 19:03
  • They are the measurement errors which are recorded when I calculate the 4 variables. What A, B, C and D actually are is difficult to explain but let's say they are: 'amount of sunlight', 'amount of rain', 'soil porosity' and 'altitude', where Y is something like 'plant growth rate'. I have measured each of these 4 variables at various locations, but each time I measure there is an associated error, which is recorded in the A_err, B_err, C_err, D_err arrays. Then I am interested in the error on 'plant growth rate' at each location. Does that make sense? – izzyrizzy Jul 11 '18 at 13:54
  • I have posted to cross-validated as you suggested: https://stats.stackexchange.com/questions/355607/calculating-the-covariance-between-1-d-arrays-for-incorporation-into-propagation – izzyrizzy Jul 11 '18 at 14:04
  • okay. Thats good. You should get an answer from CV soon, wait for it. Your question is clearer now. I will also update my answer soon, see if it helps. – Mankind_008 Jul 11 '18 at 17:44

1 Answers1

0

This is what my understanding is of your measurements: if you had a multiple measurements for a single grid. Then you would have calculated std error in measurement from those observations including cross terms. Here you have a single measurement for each grid and error associated to that measurement. Since, you measured for all 20 grids using the same procedure, what you can do is to calculate std_err for your measurement procedure and add the same uncertainty to all of your 20 observations.

For modelling the uncertainties in error measurement (error terms only) i.e. standard error in Y_err. Covariance/ cross terms gives you an idea of how uncertainties/ error in measurement are related to each other (i.e. σ_A_err_B_err) and not how actual measurement variables relates to each other(i.e. σ_AB).

Instead of: X = np.vstack([A,B,C,D])

You need: X = np.vstack([A_err,B_err,C_err,D_err]), to calculate covariance in error terms.

Your formula should be: All of them contained in your covariance matrix (including diagonal terms): np.cov(np.vstack([A_err,B_err,C_err,D_err]))

Y_std_err= np.sqrt(σ_A_err**2 + σ_B_err**2 + σ_C_err**2 + σ_D_err**2 + 
                   2*σ_A_err_B_error + 2*σ_A_err_C_err + 2*σ_A_err_D_err + 
                   2*σ_B_err_C_err + 2*σ_B_err_D_err + 2*σ_C_err_D_err)

Also, you can directly multiply matrices to get the sample variance for y, instead of the summation formula.

enter image description here

Where S, here is the sample variance/ covariance matrix for error terms and c, coefficients of linear combination of measurement model.

X = np.vstack([A_err,B_err,C_err,D_err])        # matrix of error terms
cov = np.cov(X)
c = np.asarray([1, 1, 1, 1])                    # coefficients: linear combination 
var_y_err = np.matmul(c.T, np.matmul(cov,c))

np.sqrt(var_y_err)                              # standard error in measurement of Y: stdy_y_err
# 0.007379024325749306

Y values should simply be a sum of A,B,C,D and each array point having above (+ or -) Average Y_error (depending on direction of error), having its own standard error (+/-).

Y = A + B + C + D  (+ or -) Y_err_avg     # Y_err will be an interval for every grid:
Y_err_avg = sum(A_err + B_err + C_err + D_err)/ 20 + (+/-) std_y_err

PS: I am not an expert on propagation of error. I would recommend you to post your query on cross validated to get an expert opinion.

Mankind_008
  • 2,158
  • 2
  • 9
  • 15