1

I have a for loop I would like to run by group. I would like it to run through a set of data, creates a time series for most rows, and then output a forecast for that row of data (based on that time point and the ones preceding it) in the group The issue I am having is running that loop for every 'group' within my data. I want to avoid doing so manually as that would take hours and surely there is a better way.

Allow to me explain in more detail.

I have a large dataset (1.6M rows), each row has a year, country A, country B, and a number of measures which concern the relationship between the two.

So far, I have been successful in extracting a single (country A, country B) relationship into a new table and using a for loop to output the necessary forecast data to a new variable in the dataset. I'd like to create to have that for loop run over every (country A, country B) grouping with more than 3 entries.

The data:

Here I will replicate a small slice of the data, and will include a missing value for realism.

set.seed(2000)  
df <- data.frame(year = rep(c(1946:1970),length.out=50),
                     ccode1 = rep(c("2"), length.out = 50),
                     ccode2 = rep(c("20","31"), each=25),
                     kappavv = rnorm(50,mean = 0, sd=0.25),
                     output = NA)
    df$kappavv[12] <- NA

What I've done:

NOTE: I start forecasting from the third data point of each group but based on all time points preceding the forecast.

for(i in 3:nrow(df)){
    
    dat_ts <- ts(df[, 4], start = c(min(df$year), 1), end = c(df$year[i], 1), frequency = 1)
    dat_ts_corr <- na_interpolation(dat_ts)
    trialseries <- holt(dat_ts_corr, h=1)
    df$output[i] <- trialseries$mean
  }

This part works and outputs what I want when I apply it to a single pairing of ccode1 and ccode2 when arranged correctly in ascending order of years.

What isn't working:

I am having some serious problems getting my head around applying this for loop by grouping of ccode2. Some of my data is uneven: sometimes groups are different sizes, having different start/end points, and there are missing data.

I have tried expressing the loop as a function, using group_by() and piping, using various types of apply() functions.

Your help is appreciated. Thanks in advance. I am glad to answer any clarifying questions you have.

1 Answers1

0

You can put the for loop code in a function.

library(dplyr)
library(purrr)

apply_func <- function(df) {
  for(i in 3:nrow(df)){
    
    dat_ts <- ts(df[, 4], start = c(min(df$year), 1), 
                 end = c(df$year[i], 1), frequency = 1)
    dat_ts_corr <- imputeTS::na_interpolation(dat_ts)
    trialseries <- forecast::holt(dat_ts_corr, h=1)
    df$output[i] <- trialseries$mean
  }
  return(df)
}

Split the data by ccode2 and apply apply_func.

df %>%group_split(ccode2) %>% map_df(apply_func)

#    year ccode1 ccode2 kappavv  output
#   <int> <chr>  <chr>    <dbl>   <dbl>
# 1  1946 2      20     -0.213  NA     
# 2  1947 2      20     -0.0882 NA     
# 3  1948 2      20      0.223   0.286 
# 4  1949 2      20      0.435   0.413 
# 5  1950 2      20      0.229   0.538 
# 6  1951 2      20     -0.294   0.477 
# 7  1952 2      20     -0.485  -0.675 
# 8  1953 2      20      0.524   0.405 
# 9  1954 2      20      0.0564  0.0418
#10  1955 2      20      0.294   0.161 
# … with 40 more rows
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213