0

I've created a function which I am trying to apply to a dataset using pmap. The function I've created amends some columns in a dataset. I want the amendment that's applied to the two columns to carry over to the 2nd and subsequent iterations of pmap.

Reproducible example below:

library(tidyr)
library(dplyr)

set.seed(1982)

#create example dataset
dataset <- tibble(groupvar =  sample(c(1:3), 20, replace = TRUE),
                  a = sample(c(1:10), 20, replace = TRUE),
                  b = sample(c(1:10), 20, replace = TRUE),
                  c = sample(c(1:10), 20, replace = TRUE),
                  d = sample(c(1:10), 20, replace = TRUE)) %>%
  arrange(groupvar)


#function to sum 2 columns (col1 and col2), then adjust those columns such that the cumulative sum of the two columns
#within the group doesn't exceed the specified limit
shared_limits <- function(col1, col2, group, limit){
  dataset <- dataset
  dataset$group <- dataset[[group]]
  dataset$newcol <- dataset[[col1]] + dataset[[col2]]
  dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
  dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
  dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum)) 
  dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
  dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
  dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
                                                           pmin(dataset[[col1]], dataset$adjusted_sum))
  dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
  dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
  dataset
}

#apply function directly
new_dataset <- shared_limits("a", "b", "groupvar", 25)

#apply function using a separate parameters table and pmap_dfr
shared_limits_table <- tibble(col1 = c("a","b"),
                              col2 = c("c","d"),
                              group = "groupvar",
                              limit = c(25, 30))

dataset <- pmap_dfr(shared_limits_table, shared_limits)

In the example above the pmap function applies the shared limit to columns "a" and "c" and returns an adjusted dataset as the first element in the list. It then applies the shared limit to columns "b" and "d" and returns this as the second element in the list. However the adjustments that have been made to "a" and "c" are now lost.

Is there any way of storing the adjustments that are made to each column as we progress through each iteration of pmap?

andrew
  • 49
  • 5
  • You want to use purrr::reduce, similar to the use of accumulate in my answer at https://stackoverflow.com/questions/58959233/how-to-do-cumulative-filtering-with-purrraccumulate – pgcudahy Feb 24 '20 at 12:08

1 Answers1

1

You can iteratively apply a function to your dataset with reduce

First, I'd fix your function since dataset is undefined

shared_limits <- function(df, col1, col2, group, limit){
  dataset <- df
  dataset$group <- dataset[[group]]
  dataset$newcol <- dataset[[col1]] + dataset[[col2]]
  dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
  dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
  dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum)) 
  dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
  dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
  dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
                                                           pmin(dataset[[col1]], dataset$adjusted_sum))
  dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
  dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
  dataset
}

Then make a list of the arguments you want to pass to the function at each step

shared_limits_args_list <- list(
    list("a", "c", "groupvar", 25), 
    list("b", "d", "groupvar", 30))

Then call reduce, setting the dataset as your initial x with the .init parameter. At each iteration a sublist of arguments from shared_limits_args_list will be passed to the function as y. [[ is used to select the list elements for each position. The output dataframe from the function will become the new x for the next iteration, and the next sublist of shared_limits_args_list will be the next set of arguments. When all of the sublists of shared_limits_args_list have been used, the final dataframe is output.

dataset_combined <- 
    reduce(shared_limits_args_list, 
    function(x,y) shared_limits(df=x, y[[1]], y[[2]], y[[3]], y[[4]]),
    .init=dataset)
pgcudahy
  • 1,542
  • 13
  • 36