Weighted mean of similar elements in a vector in R

Question

I have two vectors x and w. w is a numerical vector of weights the same length as x giving the weights to use for elements of x.

I would like to give weighted mean of the elements in vector x which their difference are small(for example 1e-1 or 1e-2) to decrease the length of vector x. for example, these vectors are as follows:

    w =c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04,
         2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04,
         4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03,
         1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02,
         1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03,
         3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01,
         3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04,
         7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04,
         8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04,
         5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07,
         3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07,
         2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07,
         2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07,
         2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03)

     x =c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900,
          0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500,
          1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027,
          2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953,
          2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953,
          3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358,
          3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813,
          0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596,
          2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608,
          3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866)

I know how to sort the vector x according to their weights but how can I recognize the similar values in vector x and then getting their weighted mean?

It is not really clear what you are asking especially considering that the e-01 and e-02 values are not small; the e-07 values are small. But: if you just want the weighted average go ahead and *multiply* x and w, then mean() the result. — Bernd Elkemann, Jun 07 '12 at 10:32
My interpretation was that the aim is to find "similar values in the vector x"... if these values are grouped together then this will decrease the length of the vector x. Intuitively, this might mean that two similar elements x1 and x2 having weights w1 and w2 get replaced by a compromise value x having weight w1+w2, such that the total weighted mean remains the same... — Tim P, Jun 07 '12 at 10:41
@TimP: Thanks for your comment, your interpretation is correct when we have two similar values, but actually may be I have more than two similar values(for example 3, or 4, ... similar values of 0.4 and 3, 4 similar values of 1.7). I would like to find these similar values and the get the weighted mean of that group and having weight w1+ w2+ w3+... the code that you wrote find the two similar values. what shall i do when its more than two similar values? — Bensor Beny, Jun 08 '12 at 01:06
Quick clarification: so you want to choose a value like 0.01 that represents two elements being "close", and then if x contains a sequence like (1.800, 2.000, 2.008, 2.016, 2.200) you'd group together elements 2, 3 and 4 since the adjacent distances (2.000 to 2.008, and 2.008 to 2.016) are both less than 0.01? Even though elements 2 and 4 are separated by more than 0.01 overall? — Tim P, Jun 08 '12 at 01:34
@TimP:Yes. That's exactly i am looking for. I have wrote a fuction as you can see above but I do not know why the loop deoes not stop? — Bensor Beny, Jun 08 '12 at 01:39
There's lots of crazy stuff going on in that code - but conceptually it's certainly not checking the adjacent distances correctly (e.g. in my comment above, it would group 2.000 with 2.008, but wouldn't group with 2.016 because that's too far from 2.000). Take a look at the code example I've added to the top of my answer as I've tried to make it as clean as possible (it's quite a fiddly thing to code right). — Tim P, Jun 08 '12 at 02:21

Tim P · Accepted Answer · 2012-06-08T02:23:02.847

UPDATED ANSWER

How about something like this...? (See code below)

I called your original vectors origx and origw, so that the reordered ones are x and w. The code works on temporary copies of x and w (called xtemp and wtemp) which get destroyed, and builds up the new x and w (i.e. the "shorter" vectors you seek) in the variables xnew and wnew.

In simple terms, the code looks at xtemp and finds the first gap exceeding the threshold size (e.g. 0.05), and groups together all the elements from the start of xtemp running up to that "large" gap. (If there's no such gap it takes the whole of xtemp as a group.) The code then converts that group into a single weight called wgroup (the total of the group weights) and a single representative x value called xgroup (such that xgroup*wgroup is the same as the weighted sum of all the group elements). We then save xgroup and wgroup into the vectors xnew and wnew, wipe out the current group (by eliminating it from xtemp and wtemp), and then carry on in the same way till everything's been grouped.

Give it a test run and see what you think :)

origw = c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04,
          2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04,
          4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03,
          1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02,
          1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03,
          3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01,
          3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04,
          7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04,
          8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04,
          5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07,
          3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07,
          2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07,
          2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07,
          2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03)

origx = c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900,
          0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500,
          1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027,
          2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953,
          2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953,
          3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358,
          3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813,
          0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596,
          2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608,
          3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866)

reord = order(origx)
x = origx[reord]
w = origw[reord]

xnew = wnew = c()

thresh = 0.05
xtemp = x
wtemp = w
while (length(xtemp) > 0) {
nextgap = which(diff(xtemp) > thresh)[1]
if (!is.na(nextgap)) {
    group = seq_len(nextgap)
} else {
    group = seq_along(xtemp)
}
xgroup = sum((xtemp*wtemp)[group])/sum(wtemp[group])
wgroup = sum(wtemp[group])
xnew = c(xnew, xgroup)
wnew = c(wnew, wgroup)
xtemp = xtemp[-group]
wtemp = wtemp[-group]
}

OLD RESPONSE IS BELOW (superseded by the above...)

I'd suggest reordering x and w so that x is in strict numerical order, and then using the diff function:

reord = order(x)
x2 = x[reord]
w2 = w[reord]
which(diff(x2)<0.01)

The final command above indicates which elements in x2 (the sorted version of x) are within 0.01 of the next-highest element. The first value is 2 since elements 2 and 3 of x2 are such an example: x2[2]=0.4399527 and x2[3]=0.4429813.

Also, if you do

sort(diff(x2))

you can see all the differences arranged in numerical order, which might help you decide what a suitable cutoff should be.

No problem, give it a test run and let me know if it all works out ok :) — Tim P, Jun 08 '12 at 02:44

Weighted mean of similar elements in a vector in R

1 Answers1