0

I'm struggling in adjusting a function I wrote, and it's giving me a headache, so thought I'd post it here.

in the function I'm using the "by" function of R, which puts the dataframe in subsets and runs a function on it.

Now I'm expanding the function to include weighted.mean (from the 'descr' package) and I'm getting an error that the length of the x and w are not equal.

some code to show:

set.seed(100)
d1 <- rnorm(300)
d2 <- (floor(runif(100, min=1, max=4)))
weight <- rnorm(300,mean = 1, sd = 1)
df <- cbind.data.frame(d1,d2,weight)
df$d2 <- factor(df$d2,
                levels = c(1,2,3,4),
                labels = c("red", "blue", "green","orange")) 



require('descr')

by(df$d1, df$d2, function(x) mean(x=x, na.rm=TRUE))
by(df$d1, df$d2, function(x) weighted.mean(x=x,w=df$weight na.rm=TRUE))

So i'm making a dataframe with 1 numerical value, 1 factor with 4 levels, though only 3 have data (eg missing/fitlered data) and a weight variable.

The 8th command is what i have now, but now i need to add weight into it as well. So this gives me the average per colour. also it returns NA for the levels of d2 where i don't have any data for, which is what i need. (As i'm working on different sets of data and need to merge results, it's important that all levels that are defined are also outputted.)

the 9th command (the one with weighted.mean in it) returns an error that the lenght of x/w is different. This is because by creates a subset of df$d1 per piece of df$d2, but the weight in the weighted.mean(x=x, w = weight,...) is the entire variable and not only the part of the subset.

I have been trying to look at the code of weighted.mean and see if i can't rewrite it, but haven't found the solution. Any help is always welcome.

lmo
  • 37,904
  • 9
  • 56
  • 69

1 Answers1

0

The trick is to use the whole data.frame as input divided by the indicies

by(data = df, INDICES = df$d2, FUN = function(dfgroup) {
  weighted.mean(x = dfgroup$d1, w = dfgroup$weight, na.rm=TRUE)
})
Drey
  • 3,314
  • 2
  • 21
  • 26