What's a tidyverse approach to iterating over rows in a data frame when vectorisation is not feasible?

Question

I want to know the best way to iterate over rows of a data frame when the value of a variable at row n depends on the value of variable(s) at row n-1 and/or n-2. Ideally I would like to do this in a "tidyverse" way, perhaps with purrr::pmap().

For example, say I have this data frame:

library(dplyr)

x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))

x
#> # A tibble: 10 x 4
#>        t     a      b         c
#>    <int> <dbl>  <dbl>     <dbl>
#>  1     1   100  0.750  0.900   
#>  2     2   110  0.898  0.657   
#>  3     3   120  0.731  0.000137
#>  4     4   130  0.208  0.696   
#>  5     5   140  0.670  0.882   
#>  6     6    NA NA     NA       
#>  7     7    NA NA     NA       
#>  8     8    NA NA     NA       
#>  9     9    NA NA     NA       
#> 10    10    NA NA     NA

I have known values up to time (t) = 5. Beyond that, I wish to project values, using the following formulae:

a = lag(a) * 1.1
b = a * lag(b)
c = b * lag(a, 2)

This code achieves the desired output, but it's a clunky, horrible for loop that scales poorly to larger datasets:

for(i in 1:nrow(x)) {
  x <- x %>%
    mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
           b = if_else(!is.na(b), b, a * lag(b, 1)),
           c = if_else(!is.na(c), c, b * lag(a, 2)))
}

x
#> # A tibble: 10 x 4
#>        t     a        b        c
#>    <int> <dbl>    <dbl>    <dbl>
#>  1     1  100  7.50e- 1 9.00e- 1
#>  2     2  110  8.98e- 1 6.57e- 1
#>  3     3  120  7.31e- 1 1.37e- 4
#>  4     4  130  2.08e- 1 6.96e- 1
#>  5     5  140  6.70e- 1 8.82e- 1
#>  6     6  154  1.03e+ 2 1.34e+ 4
#>  7     7  169. 1.75e+ 4 2.45e+ 6
#>  8     8  186. 3.26e+ 6 5.02e+ 8
#>  9     9  205. 6.68e+ 8 1.13e+11
#> 10    10  225. 1.51e+11 2.80e+13

Does the solution need to work if the NAs are patchy ie dispersed throughout the data frame, or are they always at the bottom of a column? If a column can be relied on to "stay" NA once it has started being an NA, that makes some things simpler. — Peter Ellis, Oct 05 '20 at 05:48

score 5 · Answer 1 · answered Oct 05 '20 at 06:23

I think that for this sort of intrinsically iterative process it is genuinely hard to beat a for loop. The method proposed by @Shree depends on the NAs being continuous and starting in a known spot.

Here is my mild improvement on your loop, which I think is more readable and about 2.5 times the speed and will probably scale up better than your approach which combines vectorized operations with the loop. By moving out of the tidyverse altogether and embracing a rowwise loop that really works on each row one at a time, we get some efficiencies on both counts:

method_peter <- function(x){
  for(i in 2:nrow(x)){
    x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1,       x[i, "a"])
    x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
    x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
  }
  return(x)
}

There's doubtless more efficiencies possible, and of course this is an ideal candidate to rewrite it in C++ :).

This is about twice as fast as your method as seen by this:

method_matt <- function(x){
  for(i in 1:nrow(x)) {
    x <- x %>%
      mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
             b = if_else(!is.na(b), b, a * lag(b, 1)),
             c = if_else(!is.na(c), c, b * lag(a, 2)))
  }
  return(x)
}

set.seed(123)
x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))

stopifnot(identical(method_matt(x), method_peter(x)))

library(microbenchmark)
microbenchmark(
  method_matt(x),
  method_peter(x)
)

which returns:

Unit: milliseconds
            expr     min       lq     mean   median      uq     max neval
  method_matt(x) 24.1975 25.50925 30.64438 26.33310 31.8681 74.5093   100
 method_peter(x) 10.0005 10.56050 13.33751 11.06495 13.5913 42.0568   100

@Shree's method is much faster again and is ideal for the example data, but I'm not sure it is flexible enough to work in all your use cases.

I would like to see a tidyverse solution if there is one.

Ian · Answer 2 · 2022-04-06T11:29:02.917

Edit: Added tidyverse approach

Here's a readable and flexible tidyverse approach. The downside is that it is very slow.

accumutate <- function(df, ...){
  df %>% group_by(row_number()) %>%
    nest() %>%
    pull(data) %>%
    accumulate(function(x,y){bind_rows(x,y) %>% mutate(!!!enquos(...)) }) %>%
    .[[length(.)]]
}

x %>%
  accumutate(a = ifelse(is.na(a), 1.1 * lag(a,1), a)) %>%
  accumutate(b = ifelse(is.na(b), a * lag(b), b)) %>%
  accumutate(c = ifelse(is.na(c),b * lag(a, 2), c))

#> # A tibble: 10 x 4
#>        t     a        b        c
#>    <int> <dbl>    <dbl>    <dbl>
#>  1     1  100  2.88e- 1 4.56e- 2
#>  2     2  110  7.88e- 1 5.28e- 1
#>  3     3  120  4.09e- 1 8.92e- 1
#>  4     4  130  8.83e- 1 5.51e- 1
#>  5     5  140  9.40e- 1 4.57e- 1
#>  6     6  154  1.45e+ 2 1.88e+ 4
#>  7     7  169. 2.45e+ 4 3.43e+ 6
#>  8     8  186. 4.57e+ 6 7.04e+ 8
#>  9     9  205. 9.37e+ 8 1.59e+11
#> 10    10  225. 2.11e+11 3.94e+13

^{Created on 2020-10-07 by the reprex package (v0.3.0)}

Here's another approach that you might find interesting. It's not concise or especially readable, but it's tidyverse (or at least functionally) inspired. And it performs fairly well.

It uses a semigroup pattern, converting the mutate expressions into binary functions, creating corresponding lists and then using accumulate.

library(tidyverse)
library(dplyr)
library(microbenchmark)
options(width =100)

set.seed(123)

# Create the data frame
x <- tibble(t = c(1:100),
            a = c(seq(100, 140, 10), rep(NA_real_,100- 5)),
            b = c(runif(5), rep(NA_real_, 100-5)),
            c = c(runif(5), rep(NA_real_, 100-5)))

a_mappend <- function(a1, a2) {
  ifelse(is.na(a2), a1 * 1.1, a2)
}

b_mappend <- function(ab1, ab2) {
    list(a = ab2$a, b =  ifelse(is.na(ab2$b), ab2$a * ab1$b,ab2$b))
}

c_mappend <- function(abc12, abc23) {
  list(abc1 = list(a = abc12$abc2$a, b = abc12$abc2$b, c = abc12$abc2$c),
       abc2 = list(a = abc23$abc2$a, b = abc23$abc2$b, c = ifelse(is.na(abc23$abc2$c),abc12$abc1$a * abc23$abc2$b,abc23$abc2$c)))
}

method_ian <- function(x) {
  x %>%
    mutate(a = accumulate(a, a_mappend)) %>%
    mutate(b = list(a, b) %>% 
                  pmap(~ list(a = .x, b = .y)) %>% 
                  accumulate(b_mappend) %>% map_dbl(~ .x$b)) %>%
    mutate(c = list(a, b, c, c(a[-1], NA), c(b[-1], NA), c(c[-1], NA)) %>%
                  pmap(~ list(abc1 = list(a = ..1, b = ..2, c = ..3),
                              abc2 = list(a = ..4, b = ..5, c = ..6))) %>% 
                  accumulate(c_mappend) %>% map_dbl(~ .x$abc1$c))
}


method_matt <- function(x){
  for(i in 1:nrow(x)) {
    x <- x %>%
      mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
             b = if_else(!is.na(b), b, a * lag(b, 1)),
             c = if_else(!is.na(c), c, b * lag(a, 2)))
  }
  return(x)
}

method_peter <- function(x){
  for(i in 2:nrow(x)){
    x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1,       x[i, "a"])
    x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
    x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
  }
  return(x)
}

stopifnot(identical(method_matt(x), method_ian(x)))

microbenchmark( method_matt(x), method_peter(x), method_ian(x))
#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval
#>   method_matt(x) 324.90086 330.93192 337.46518 334.55447 338.38461 426.30457   100
#>  method_peter(x) 208.27498 211.60526 213.59438 212.66088 214.36421 242.59854   100
#>    method_ian(x)  13.06774  13.43105  14.30003  13.86428  14.32263  19.54843   100

^{Created on 2020-10-06 by the reprex package (v0.3.0)}

This is very ingenious. It scores very well for performance, but not so good for readability! Me, I'd rather re-write mine in Rcpp I think given the choice, if I needed to extract performance. — Peter Ellis, Oct 06 '20 at 12:38
@PeterEllis It might have some minor issues regarding readability... ;) Agreed — Rcpp would be the best combination of performance and clarity. FWIW, I've added another alternative above that you might use if performance isn't an issue at all. — Ian, Oct 07 '20 at 01:28

score 2 · Accepted Answer · answered Sep 01 '19 at 03:53

I don't think there's any simple way in tidyverse to do calculations with row-dependencies. Something with Reduce or gather + spread could be possible but I don't expect them to score poits on readability.

Anyways, on the bright side, your calculations are vectorizable using dplyr and zoo packages -

x %>% 
  mutate(
    a = ifelse(is.na(a), na.locf(a) * 1.1^(t-5), a),
    b = ifelse(is.na(b), na.locf(b) * c(rep(1, 5), cumprod(a[6:n()])), b),
    c = ifelse(is.na(c), b * lag(a, 2), c)
  )

 # A tibble: 10 x 4
 t     a        b        c
 <int> <dbl>    <dbl>    <dbl>
 1     1  100  1.85e- 1 9.43e- 1
 2     2  110  7.02e- 1 1.29e- 1
 3     3  120  5.73e- 1 8.33e- 1
 4     4  130  1.68e- 1 4.68e- 1
 5     5  140  9.44e- 1 5.50e- 1
 6     6  154  1.45e+ 2 1.89e+ 4
 7     7  169. 2.46e+ 4 3.45e+ 6
 8     8  186. 4.59e+ 6 7.07e+ 8
 9     9  205. 9.40e+ 8 1.59e+11
10    10  225. 2.12e+11 3.95e+13

Data -

set.seed(2)
x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))

This method depends on knowing that the NAs all start after the 5th observation, and being continuous with eachother (ie not stopping and starting). — Peter Ellis, Oct 05 '20 at 06:06

What's a tidyverse approach to iterating over rows in a data frame when vectorisation is not feasible?

3 Answers3