0

I have the following dataset:

# A tibble: 450 x 546
   matchcode idstd year  country wt  region income industry sector ownership exporter c201  c202  c203a c203b c203c c203d c2041 c2042 c205a c205b1 c205b2 c205b3 c205b4 c205b5 c205b6 c205b7
   <chr+lbl> <dbl> <dbl> <chr+l> <dbl> <dbl+> <dbl+> <dbl+lb> <dbl+> <dbl+lbl> <dbl+lb> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+> <dbl+>
 1 "BGD 200~  2474 2002  Bangla~ 0.7    6      1       3       1      2         1        1994  2     100   0      0    NA     2    NA     NA   NA     NA     NA     NA     NA     NA     NA    
 2 "BGD 200~  2717 2002  Bangla~ 0.9    6      1       2       1      2         2        1986  4     100   0      0    NA     2    NA     NA   NA     NA     NA     NA     NA     NA     NA    
 3 "BGD 200~  2410 2002  Bangla~  NA    6      1       3       1      2         1        1999  4     100   0      0    NA     2    NA     NA   NA     NA     NA     NA     NA     NA     NA    
 4 "BRA 200~ 14917 2003  Brazil~  NA    4      2       8       1      2         2        1984  2     100   0      0     0     2    NA     50    1     NA     NA     NA     NA     NA     NA    
 5 "BRA 200~ 14546 2003  Brazil~ 1.1    4      2       2       1      2         2        1976  2     100   0      0     0     2    NA     50    1     NA     NA     NA     NA     NA     NA    
 6 "BRA 200~ 14709 2003  Brazil~  NA    4      2       3       1      2         2        1990  2     100   0      0     0     2    NA    100   NA      1     NA     NA     NA     NA     NA    
 7 "KHM 200~ 16475 2003  Cambod~  NA    2      1      20       2      2         2        1999  2     100   0      0     0     2    NA    100   NA     NA     NA      1     NA     NA     NA    
 8 "KHM 200~ 16298 2003  Cambod~  NA    2      1       4       3      2         2        1993  4     100   0      0     0     2    NA    100    1     NA     NA     NA     NA     NA     NA    
 9 "KHM 200~ 16036 2003  Cambod~ 0.5    2      1      21       2      2         2        1997  2     100   0      0     0     2    NA    100   NA      1     NA     NA     NA     NA     NA    
10 "CHN 200~ 17862 2002  China2~ 1.2    2      2      18       2      2         2        1993  3      49   0     51    NA    NA    NA     NA   NA     NA     NA     NA     NA     NA     NA    

and I am using the following data.table solution to create country level data from observational data:

cols = sapply(df, is.numeric) # 
cols = names(cols)[cols]
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]

Although the code works well, my dataset has weights for some observations, which I want to incorporate into my code. I have been thinking about how to do this but I cannot figure it out. Would it be possible to do write a function to add to the data.table solution? Something like:

dfclevel = df[, lapply(.SD, wfunc, na.rm=TRUE), .SDcols = cols,]

wfunc <- function(x,y)  # x = df, y=weights
for (i in nrow(df$weights) {
  if (df$weights[i] == !is.na){
    df[,i] <- df[,i]*df$weights[i]
  }

Or maybe I am even overthinking this?

EDIT: Based on the comments below I tried:

dfclevel= df[, lapply(.SD, weighted.mean(x, as.vector(wt), na.rm=TRUE), na.rm=TRUE), .SDcols = cols, by=matchcode]

weighted.mean

It gives me the error:

Error in weighted.mean(x, as.vector(wt), na.rm = TRUE) : 
  object 'x' not found

How do I specify that x should be a column of the df?

I tried this but it did not work:

dfclevel= df[, lapply(.SD, lapply(weighted.mean(x, as.vector(wt), na.rm=TRUE)), na.rm=TRUE), .SDcols = cols, by=matchcode] 
Error in match.fun(FUN) : argument "FUN" is missing, with no default
Tom
  • 2,173
  • 1
  • 17
  • 44
  • 1
    do you mean `weighted.mean`? – chinsoon12 Sep 13 '18 at 10:12
  • That is way easier than anything I could have hoped for. It becomes `dfclevel = df[, lapply(.SD, weighted.mean(x, df$weight, na.rm=TRUE), na.rm=TRUE), .SDcols = cols, by=matchcode]` then right? – Tom Sep 13 '18 at 10:18
  • 1
    maybe u don't need the `df$` inside `weighted.mean`? – chinsoon12 Sep 13 '18 at 10:23
  • @chinsoon12 Thank you for your comments! I think I got quite a bit further (I updated my post), but I am still not completely there. Would you mind taking another look? – Tom Sep 13 '18 at 11:16
  • 5
    `lapply(.SD, weighted.mean, wt, na.rm = TRUE)` – Hugh Sep 13 '18 at 11:33
  • Thanks that worked! I guess I was still making it a little bit too complicated. – Tom Sep 13 '18 at 11:44
  • @Hugh Is it possible to add either a variance or sd in one go as well? – Tom Sep 14 '18 at 07:57
  • Not quite so pithily: `lapply(.SD, function(x) list(weighted.mean(x, wt), var(x), sd(x)))` – Hugh Sep 14 '18 at 08:01

0 Answers0