How to fill in the missing values in a dataframe in R, where the logic for filling in missing values is percentage change estimation?

Question

I am working with cost data for a retailer, where I am making some predictions using gam that look as follows (sample data, self-generated). The GAM fits values in the middle, but has some NAs at the extremes. The elasticity is calculated as a percentage change in cost over percentage change in items.

df <- tibble(
  factor = seq(0.7,1.3, 0.1),
  items = c(7, 8, 9, 10, 11, 12, 13),
  cost = c(NA, NA, 70, 80, 90, NA, NA),
  elasticity = c(NA, NA, 0.5, 0.6, 0.7, NA, NA)
)

An easy estimate for the elasticises is to extend the last known value up and down.

df %>%
    fill(elasticity, .direction = 'updown') ->
    df

factor	items	cost	elasticity
0.7	7	NA	0.5
0.8	8	NA	0.5
0.9	9	70	0.5
1.0	10	80	0.6
1.1	11	90	0.7
1.2	12	NA	0.7
1.3	13	NA	0.7

I would like to calculate the cost, having estimated the elasticity of cost. For example, for a factor of 1.2, the items are 12 and elasticity is 0.7. The percentage change in items is (12-11)/11 = 9.09%, so the percentage change in cost should be 0.7 * 9.09% = 6.36%. Since cost for factor of 1.1 is 90, the cost for factor of 1.2 is 95.72. And the same propagated both down and up.

I cannot figure a way of doing this. Can someone suggest how this can be done in R and preferably in dplyr?

I think its representative of the data. Can you point out what logic isn't sound? — Sarthak Nigam, Oct 25 '22 at 15:22
Ah, yes, true. The values should have been closer to the actual value of elasticity given the data. I'll change the values — Sarthak Nigam, Oct 25 '22 at 15:34

Captain Hat · Answer 1 · 2022-10-26T08:01:56.693

One possible solution using dplyr and purrr:

Libraries

library(dplyr)
library(tidyr)
library(purrr) # used for pmap() in dbl_fill()

Data

NB you should avoid using <- when naming list elements, tibble columns and the like.

df <- tibble(
  factor = seq(0.7,1.3, 0.1),
  items = c(7, 8, 9, 10, 11, 12, 13),
  cost = c(NA, NA, 70, 80, 90, NA, NA),
  elasticity = c(NA, NA, 0.5, 0.6, 0.7, NA, NA)
)

df <- fill(df, elasticity, .direction = 'updown')

Create `dbl_fill()`

I realised whilst mutating df that I needed to recursively lag() or lead() columns according to the 'depth' of missing values. I assumed your data was example data, so figured it would be best to make a generic function that lags/leads a numeric vector up to a maximum acceptable depth (2 is enough for this example).

dbl_fill <- function(x, lag_or_lead = c("lag", "lead"), max_fill = 2){

  lag_or_lead <- match.arg(lag_or_lead)
  if(lag_or_lead == "lag") fill_function <- lag
  else fill_function <- lead
  
  n_list <- as.list(1:max_fill)
  
  fill_list <- lapply(n_list, function(y) fill_function(x, y))
 
  vector_out <- pmap_dbl(fill_list, coalesce)
  
  return(vector_out)
}

Perform various mutations

We can perform prolific mutations on df, then just select() away the columns we don't want to keep.

I'm lagging/leading df$items here because I don't want to assume values in that column are incremented by one between rows. But if they are, it would be sufficient to do items - 1 as opposed to lag(items). For the example, these are equivalent.

df <- mutate(df,
             ## calculate new costs from lagged values
             pc_change_items = (items - lag(items)) / lag(items),
             lagged_cost     = dbl_fill(cost, "lag", 2),
             pc_change_cost  = pc_change_items * elasticity,
             lag_cost        = lagged_cost * (1 + pc_change_cost),
             
             ## calculate new costs from lead'ed values
             pc_change_items = (lead(items) - items) / items,
             leaded_cost     = dbl_fill(cost, "lead", 2),
             pc_change_cost  = pc_change_items * elasticity,
             lead_cost       = leaded_cost * (1 - pc_change_cost)
             )

Drop intermediate columns

df <- select(df, factor, items, cost, elasticity, lag_cost, lead_cost)
df
#> # A tibble: 7 × 6
#>   factor items  cost elasticity lag_cost lead_cost
#>    <dbl> <dbl> <dbl>      <dbl>    <dbl>     <dbl>
#> 1    0.7     7    NA        0.5     NA        65  
#> 2    0.8     8    NA        0.5     NA        65.6
#> 3    0.9     9    70        0.5     NA        75.6
#> 4    1      10    80        0.6     74.7      84.6
#> 5    1.1    11    90        0.7     85.6      NA  
#> 6    1.2    12    NA        0.7     95.7      NA  
#> 7    1.3    13    NA        0.7     95.2      NA

Coalesce costs

Note that here I'm giving priority to lag_cost over lead_cost, but that's completely arbitrary and you may want to justify it. It might be more balanced to get the mean of the two where both are available, but that's out of scope for this answer.

mutate(df, cost = coalesce(cost, lag_cost, lead_cost)) |> 
  select(-lag_cost, -lead_cost)
#> # A tibble: 7 × 4
#>   factor items  cost elasticity
#>    <dbl> <dbl> <dbl>      <dbl>
#> 1    0.7     7  65          0.5
#> 2    0.8     8  65.6        0.5
#> 3    0.9     9  70          0.5
#> 4    1      10  80          0.6
#> 5    1.1    11  90          0.7
#> 6    1.2    12  95.7        0.7
#> 7    1.3    13  95.2        0.7

^{Created on 2022-10-25 with reprex v2.0.2}

Ric · Answer 2 · 2022-10-25T16:15:12.603

To do the rolling fill in one pass with dplyr:

library(dplyr)

df <- tibble(
  factor = seq(0.7,1.3, 0.1),
  items = c(7, 8, 9, 10, 11, 12, 13),
  cost = c(NA, NA, 70, 80, 90, NA, NA),
  elasticity = c(NA, NA, 0.5, 0.6, 0.7, NA, NA)
)

df %>%
    tidyr::fill(elasticity, .direction = 'updown') ->
    df

df <- df %>% 
  group_by(na_grp = with(
    rle(is.na(cost) | is.na(lead(cost))), 
    rep(seq_along(lengths),lengths))) %>% 
  mutate(cost = if_else(is.na(cost), head(cost,1) * cumprod(
    if_else(is.na(cost), (items / lag(items)-1) * elasticity + 1, 1)),cost)) %>% 
  group_by(na_grp = with(
    rle(is.na(cost) | is.na(lag(cost))), 
    rep(seq_along(lengths),lengths))) %>% 
  mutate(cost = if_else(is.na(cost), tail(cost,1) / rev(cumprod(rev(
     if_else(is.na(cost), (lead(items) / items - 1) * lead(elasticity) + 1, 1)))),cost)) %>% 
  ungroup() %>% 
  select(-na_grp)

df
#> # A tibble: 7 × 4
#>   factor items  cost elasticity
#>    <dbl> <dbl> <dbl>      <dbl>
#> 1    0.7     7  61.5        0.5
#> 2    0.8     8  65.9        0.5
#> 3    0.9     9  70          0.5
#> 4    1      10  80          0.6
#> 5    1.1    11  90          0.7
#> 6    1.2    12  95.7        0.7
#> 7    1.3    13 101.         0.7

Hey Ric, I cannot reproduce the output when I try the same on my computer. The cost comes out to be partially filled — Sarthak Nigam, Oct 26 '22 at 04:33
I can run the full example in a fresh R session with no modification. if you `print(df$cost)` can you see full values? what do you mean with partially filled? — Ric, Oct 26 '22 at 04:42
I mean that cost still has many NAs. In the example above, of the 4 NAs, only 1 was filled. I had done it on a fresh R session. Is there another library to import? — Sarthak Nigam, Oct 26 '22 at 05:49

factor	items	cost	elasticity
0.7	7	NA	0.5
0.8	8	NA	0.5
0.9	9	70	0.5
1.0	10	80	0.6
1.1	11	90	0.7
1.2	12	NA	0.7
1.3	13	NA	0.7

factor	items	cost	elasticity
0.7	7	NA	0.5
0.8	8	NA	0.5
0.9	9	70	0.5
1.0	10	80	0.6
1.1	11	90	0.7
1.2	12	NA	0.7
1.3	13	NA	0.7