Calculate statistic (e.g sum, mean...) base on columns from another dataset

Question

I have those two datasets

   Week Year     value   min_date
1     5 2019 0.2544485 2019-01-29
2     6 2019 0.4517341 2019-02-05
3     7 2019 0.3850368 2019-02-12
4     8 2019 0.2638608 2019-02-19
5     9 2019 0.4701723 2019-02-26
6    10 2019 0.4299216 2019-03-05
7    11 2019 0.3539982 2019-03-12
8    12 2019 0.3121920 2019-03-19
9    13 2019 0.4851671 2019-03-26
10   14 2019 0.3682839 2019-04-02
11   15 2019 0.3118504 2019-04-09
12   16 2019 0.5535208 2019-04-16
13   17 2019 0.5439670 2019-04-23
14   18 2019 0.2279846 2019-04-30
15   19 2019 0.9141420 2019-05-07

   open_window close_window      ID Week
1   2019-02-02   2019-04-06  19-100    5
2   2009-03-04   2009-05-06  09-414    9
3   2004-03-06   2004-05-08  04-334   10
4   2004-02-24   2004-04-27  04-076    8
5   2007-02-16   2007-04-20  07-603    7
6   2012-12-25   2013-02-26  13-706   52
7   2018-03-31   2018-06-02 18-1197   13
8   2008-03-03   2008-05-05  08-415    9
9   2008-02-01   2008-04-04  08-922    5
10  2007-03-14   2007-05-16  07-292   11

and i would like to add a column to the second dataset with the average value from the first dataset that is between the dates open_window and close_window. For example in the first row it would be expected Mean value = 0.3911519 (mean of all values from row 2 to 10).

Thanks!!

Data:

Dates = structure(list(Week = 5:19, Year = c(2019L, 2019L, 2019L, 2019L, 
2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 
2019L, 2019L), value = c(0.254448507, 0.45173412229249, 0.385036779948187, 
0.263860826315789, 0.470172300321285, 0.429921606571429, 0.35399815840708, 
0.312191999393939, 0.485167061091703, 0.368283872580645, 0.311850394166667, 
0.553520780789474, 0.54396702, 0.227984592352941, 0.914141983686275
), min_date = structure(c(17925, 17932, 17939, 17946, 17953, 
17960, 17967, 17974, 17981, 17988, 17995, 18002, 18009, 18016, 
18023), class = "Date")), class = "data.frame", row.names = c(NA, 
-15L))

Data = structure(list(open_window = structure(c(17929, 14307, 12483, 
12472, 13560, 15699, 17621, 13941, 13910, 13586), class = "Date"), 
    close_window = structure(c(17992, 14370, 12546, 12535, 13623, 
    15762, 17684, 14004, 13973, 13649), class = "Date"), ID = c("19-100", 
    "09-414", "04-334", "04-076", "07-603", "13-706", "18-1197", 
    "08-415", "08-922", "07-292"), Week = c(5, 9, 10, 8, 7, 52, 
    13, 9, 5, 11)), row.names = c(NA, -10L), class = "data.frame")

I think `data.table`'s non-equi join would be the fastest. See this example https://stackoverflow.com/questions/64202974/sum-partial-datatable-as-column-for-another-datatable/ instead of `sum` use `mean`. — Ronak Shah, Dec 02 '20 at 05:15
yes! thanks i think that could work. Just by curiosity is there a way to do it with the tidyverse? — Ian.T, Dec 02 '20 at 05:19

score 3 · Answer 1 · answered Dec 02 '20 at 06:20

Base R solution:

# Apply using row-number of df2:  
df2$avg <- sapply(seq_len(nrow(df2)), function(i) {
  mean(with(df1, value[min_date >= df2$open_window[i] & (min_date + 6) <= df2$close_window[i]]))
  }
)

Data:

df1 <- structure(list(Week = 5:19, Year = c(2019L, 2019L, 2019L, 2019L, 
2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 
2019L, 2019L), value = c(0.2544485, 0.4517341, 0.3850368, 0.2638608, 
0.4701723, 0.4299216, 0.3539982, 0.312192, 0.4851671, 0.3682839, 
0.3118504, 0.5535208, 0.543967, 0.2279846, 0.914142), min_date = structure(c(17925, 
17932, 17939, 17946, 17953, 17960, 17967, 17974, 17981, 17988, 
17995, 18002, 18009, 18016, 18023), class = "Date")), row.names = c(NA, -15L), class = "data.frame")

df2 <- structure(list(open_window = structure(c(17929, 14307, 12483, 
12472, 13560, 15699, 17621, 13941, 13910, 13586), class = "Date"), 
close_window = structure(c(17992, 14370, 12546, 12535, 13623, 
15762, 17684, 14004, 13973, 13649), class = "Date"), ID = c("19-100", 
"09-414", "04-334", "04-076", "07-603", "13-706", "18-1197", 
"08-415", "08-922", "07-292"), Week = c(5L, 9L, 10L, 8L, 
7L, 52L, 13L, 9L, 5L, 11L), avg = c(0.3940103625, NaN, NaN, 
NaN, NaN, NaN, NaN, NaN, NaN, NaN)), row.names = c(NA, -10L
), class = "data.frame")

score 3 · Answer 2 · answered Dec 02 '20 at 06:25

3

A tidyverse solution would be :

library(dplyr)

Dates %>%
  select(-Week, -Year) %>%
  tidyr::crossing(Data %>% select(-Week)) %>%
  filter(min_date >= open_window, min_date <= close_window) %>%
  group_by(ID) %>%
  summarise(value = mean(value))

answered Dec 02 '20 at 06:25

Ronak Shah

377,200
20
156
213

score 3 · Accepted Answer · answered Dec 02 '20 at 07:10

additional solution

   Data %>% 
      rowwise() %>% 
      mutate(out = list(Dates$value[Dates$min_date %within% interval(open_window, close_window)])) %>% 
      ungroup() %>% 
      mutate(res = map_dbl(out, mean, na.rm = T)) %>% 
      select(-out)

# A tibble: 10 x 5
   open_window close_window ID       Week     res
   <date>      <date>       <chr>   <dbl>   <dbl>
 1 2019-02-02  2019-04-06   19-100      5   0.391
 2 2009-03-04  2009-05-06   09-414      9 NaN    
 3 2004-03-06  2004-05-08   04-334     10 NaN    
 4 2004-02-24  2004-04-27   04-076      8 NaN    
 5 2007-02-16  2007-04-20   07-603      7 NaN    
 6 2012-12-25  2013-02-26   13-706     52 NaN    
 7 2018-03-31  2018-06-02   18-1197    13 NaN    
 8 2008-03-03  2008-05-05   08-415      9 NaN    
 9 2008-02-01  2008-04-04   08-922      5 NaN    
10 2007-03-14  2007-05-16   07-292     11 NaN

Calculate statistic (e.g sum, mean...) base on columns from another dataset

3 Answers3

Data: