How can I re-write code that applies a function on subset of rows based on another vector in different R ecosystems?

Question

in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data. So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series) and a sample of pairs individual-date (sample_events) ideally extracted from the same data.

With these, in my example I attempt to calculate an average on a subset of the time-series values exp conditional on individual and date in the sample_events.

I did this in 2 different ways:

1: a simple but effective code that gets the job done very quickly I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary then outputs the requested average.

To repeat the operation for each row in sample_events I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.

#Sample data
set.seed(111)
exp_series <- data.frame(
  id = as.character(rep(1:10000, each=2191)), 
  date = rep(seq(as.Date('2005-01-01'),
                 as.Date('2010-12-31'), by = 'day'),times=10000),
  exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)


sample_dates <- data.frame(
  Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))), 
  Event_date = sample(
    seq(as.Date('2005-01-01'),
        as.Date('2010-12-31'), by = 'day'),
    size =10000,replace = TRUE)
)



#This function, given a dataframe with dates and exposure series (df) 
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
    df<-as.data.table(df)
    end<-as.character(as.Date(event_date)-lag)
    start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
    return(mean(df[date %between% c(start,end)]$exp))
}

#Nest dataframes
exp_series_nest <- exp_series %>% 
  group_by(id) %>% 
  nest()


#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)


#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id

#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
  summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})

2: using the highly-flexible package runner

With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.

system.time(summaries2 <- sample_dates %>%
  group_by(Event_id) %>%
  mutate(
    mean = runner(
      x = exp_series[exp_series$id ==  Event_id[1],], 
      k = "365 days", 
      lag = "1 days",
      idx =exp_series$date[exp_series$id == Event_id[1]],
      at = Event_date,
      f = function(x) {mean(x$exp)},
      na_pad=FALSE
    )
  )    
)

They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.

My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.

Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.

Your example is not very intuitive, some rows in `sample_dates` are duplicated. I can provide some points for method 1. 1. nest data is not necessary, try to replace it with group. 2. merge maybe replaced by left join, e.g. `exp_series[sample_dates, on = c(id = "Event_id")]` 3. `exp_summary` may be rewritten with `frollmean` by group . — Peace Wang, Dec 18 '21 at 03:16
Thanks for your comment. IDs in `sample_dates` can be duplicated in my problem. The rest of the comments are not really useful, meaning, I already had the hunch that I could use different functions like `frollmean` or `slider` but I cannot figure out a way to make work together data in lists and single columns within those environments. — jmarkov, Dec 18 '21 at 22:29
Also, thanks for point 2. I didn't know how to use join within `data.table`. — jmarkov, Dec 18 '21 at 22:41

Waldi · Accepted Answer · 2021-12-18T12:51:57.623

1

With data.table, you could join exp_series with the range you wish in sample_dates and calculate mean by=.EACHI:

library(data.table)

setDT(exp_series)
setDT(sample_dates)


lag <- 1
width <- 365 
# Define range
sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]

# Calculate mean by .EACHI
summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
                                      ,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
                                      ,.(id,mean)]

Note that this returns the same results as summaries1 only for Event_id without duplicates in sample_dates.

The results are different in case of duplicates, for instance Event_id==1002:

sample_dates[Event_id==1002]
   Event_id Event_date      begin        end
     <char>     <Date>     <Date>     <Date>
1:     1002 2010-08-17 2009-08-16 2010-08-16
2:     1002 2010-06-23 2009-06-22 2010-06-22

If you don't have duplicates in your real data, this shouldn't be a problem.

edited Dec 18 '21 at 12:51

answered Dec 18 '21 at 10:26

Waldi

39,242
6
30
78

Thanks! In my problems IDs can be duplicated and your code still seems to work in that case as it produces averages for each duplicated rows independently. The interesting thing is that, even compared to your code, method 1 is still slightly faster. If you have enough RAM, try increasing the number of ids to 100k in `exp_series` and the number of rows to 100k in `sample_dates` – jmarkov Dec 18 '21 at 23:23
I didn't have enough RAM to test 100k rows with the loop. `data.table` ran in 40 seconds. Regarding identical results, I remarked that `summaries1-summariesDT` doesn't always return 0's, especially in the case of duplicates. Do you find the results OK in this case? – Waldi Dec 19 '21 at 09:52
Oh I see, may it be because 'data.table' does not always keep the original order of the duplicated rows? I made the check for a couple of duplicated ids. – jmarkov Dec 19 '21 at 21:53

How can I re-write code that applies a function on subset of rows based on another vector in different R ecosystems?

1 Answers1

Linked