One possible solution using dplyr
and purrr
:
Libraries
library(dplyr)
library(tidyr)
library(purrr) # used for pmap() in dbl_fill()
Data
NB you should avoid using <-
when naming list elements, tibble columns and the like.
df <- tibble(
factor = seq(0.7,1.3, 0.1),
items = c(7, 8, 9, 10, 11, 12, 13),
cost = c(NA, NA, 70, 80, 90, NA, NA),
elasticity = c(NA, NA, 0.5, 0.6, 0.7, NA, NA)
)
df <- fill(df, elasticity, .direction = 'updown')
Create dbl_fill()
I realised whilst mutating df
that I needed to recursively lag()
or lead()
columns according to the 'depth' of missing values. I assumed your data was example data, so figured it would be best to make a generic function that lags/leads a numeric vector up to a maximum acceptable depth (2 is enough for this example).
dbl_fill <- function(x, lag_or_lead = c("lag", "lead"), max_fill = 2){
lag_or_lead <- match.arg(lag_or_lead)
if(lag_or_lead == "lag") fill_function <- lag
else fill_function <- lead
n_list <- as.list(1:max_fill)
fill_list <- lapply(n_list, function(y) fill_function(x, y))
vector_out <- pmap_dbl(fill_list, coalesce)
return(vector_out)
}
Perform various mutations
We can perform prolific mutations on df
, then just select()
away the columns we don't want to keep.
I'm lagging/leading df$items
here because I don't want to assume values in that column are incremented by one between rows. But if they are, it would be sufficient to do items - 1
as opposed to lag(items)
. For the example, these are equivalent.
df <- mutate(df,
## calculate new costs from lagged values
pc_change_items = (items - lag(items)) / lag(items),
lagged_cost = dbl_fill(cost, "lag", 2),
pc_change_cost = pc_change_items * elasticity,
lag_cost = lagged_cost * (1 + pc_change_cost),
## calculate new costs from lead'ed values
pc_change_items = (lead(items) - items) / items,
leaded_cost = dbl_fill(cost, "lead", 2),
pc_change_cost = pc_change_items * elasticity,
lead_cost = leaded_cost * (1 - pc_change_cost)
)
Drop intermediate columns
df <- select(df, factor, items, cost, elasticity, lag_cost, lead_cost)
df
#> # A tibble: 7 × 6
#> factor items cost elasticity lag_cost lead_cost
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.7 7 NA 0.5 NA 65
#> 2 0.8 8 NA 0.5 NA 65.6
#> 3 0.9 9 70 0.5 NA 75.6
#> 4 1 10 80 0.6 74.7 84.6
#> 5 1.1 11 90 0.7 85.6 NA
#> 6 1.2 12 NA 0.7 95.7 NA
#> 7 1.3 13 NA 0.7 95.2 NA
Coalesce costs
Note that here I'm giving priority to lag_cost
over lead_cost
, but that's completely arbitrary and you may want to justify it. It might be more balanced to get the mean of the two where both are available, but that's out of scope for this answer.
mutate(df, cost = coalesce(cost, lag_cost, lead_cost)) |>
select(-lag_cost, -lead_cost)
#> # A tibble: 7 × 4
#> factor items cost elasticity
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.7 7 65 0.5
#> 2 0.8 8 65.6 0.5
#> 3 0.9 9 70 0.5
#> 4 1 10 80 0.6
#> 5 1.1 11 90 0.7
#> 6 1.2 12 95.7 0.7
#> 7 1.3 13 95.2 0.7
Created on 2022-10-25 with reprex v2.0.2