7

I have a data frame with two vectors of length 5 and variable:

x <- seq(1:5)
y <- rep(0,5)
df <- data.frame(x, y)
z <- 10

I need to loop through the data frame and update y based on a condition related to x using z, and I need to update z at every iteration. Using a for loop, I would do this:

for (i in seq(2,nrow(df))){
  if(df$x[i] %% 2 == 0){
    df$y[i] <- df$y[i-1] + z
    z <- z - df$x[i]
  } else{
    df$y[i] <- df$y[i-1]
  }
}

Using data frames is slow and having to access the ith item using df$x[i] is not efficient, but I am unsure how to vectorize this since both y and z will change based on each iteration.

Does anyone have a recommendation on best way to iterate this? I was loking to avoide data frames completely and just use vectors so simplify the lookups, or use something from tidyverse using tibbles and the purrr package, but nothing seemed easy to implement. Thanks!

George
  • 1,478
  • 17
  • 28
  • It would help to see your expected output: the final values of y and z. – neilfws Feb 08 '18 at 22:59
  • @neilfws Just call `df`after the loop. – jay.sf Feb 08 '18 at 23:01
  • 1
    @jaySf you're assuming the code as shown is error-free :) it helps to know what the questioner thinks the output should be – neilfws Feb 08 '18 at 23:02
  • @neil This is highly simplified. y starts at 0 and ends at 18. It is only increased when i = 2 or 4 so 10 is added first, and then 8 is added. In the real df, z is one of several functions depending on x and it takes a starting amount that changes over each iteration. The goal of this post is to simplify this type of loop or make it more efficient. I'm still new to r and functional programming. – George Feb 09 '18 at 01:33

4 Answers4

5

you can use sapply function:

y=0
z=10
sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-z-x;y},y<<-y))
[1]  0 10 10 18 18
Onyambu
  • 67,392
  • 3
  • 24
  • 53
3

Here's a vectorized version

vec_fun <- function(x, z) {
    L <- length(x)

    vec_z <- rep(0, L)
    I <- seq(2, L, by=2)
    vec_z[I] <- head(z-c(0, cumsum(I)), length(I))

    cumsum(vec_z)
}

The alternative versions - sapply & tidyverse

sapply_fun <- function(x, z) {
    y=0
    sapply(df$x,function(x)ifelse(x%%2==0,{y<<-y+z;z<<-z-x;y},y<<-y))
}

library(tidyverse)
library(tidyverse)
tidy_fun <- function(df) {
    df %>% 
      filter(x %% 2 != 0) %>%
      mutate(z = accumulate(c(z, x[-1] - 1), `-`)) %>%
      right_join(df, by = c("x", "y")) %>%
      mutate(z = lag(z), z = ifelse(is.na(z), 0, z)) %>%
      mutate(y = cumsum(z)) %>%
      select(-z) %>%
      pluck("y")
}

Your data

df <- data.frame(x=1:5, y=0)
z <- 10

Let's make sure they all return the same result

identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df))
# TRUE

Benchmark with small dataset - sapply_fun appears to be slightly faster

library(microbenchmark)
microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative")

# Unit: relative
                # expr        min         lq       mean     median         uq      max neval
    # vec_fun(df$x, z)   1.349053   1.316664   1.256691   1.359864   1.348181 1.146733   100
 # sapply_fun(df$x, z)   1.000000   1.000000   1.000000   1.000000   1.000000 1.000000   100
        # tidy_fun(df) 411.409355 378.459005 168.689084 301.029545 270.519170 4.244833   100

Now with larger data.frame

df <- data.frame(x=1:1000, y=0)
z <- 10000

Same result - yes

identical(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df))
# TRUE

Benchmark with larger dataset - now it's obvious vec_fun is faster

library(microbenchmark)
microbenchmark(vec_fun(df$x, z), sapply_fun(df$x, z), tidy_fun(df), times=100L, unit="relative")

# Unit: relative
                # expr       min        lq      mean    median        uq     max neval
    # vec_fun(df$x, z)   1.00000   1.00000   1.00000   1.00000   1.00000   1.000   100
 # sapply_fun(df$x, z)  42.69696  37.00708  32.19552  35.19225  27.82914  27.285   100
        # tidy_fun(df) 259.87893 228.06417 201.43230 218.92552 172.45386 380.484   100
CPak
  • 13,260
  • 3
  • 30
  • 48
  • 1
    It seems like `tidy_fun` is from my post. I would suggest that do `pluck("y")` at the end of the pipe rather than assign the whole thing to `df2` and then access the `y` column if you want the function to only return a vector. – www Feb 09 '18 at 04:56
  • But good solution. My suggestion is only want to make sure you conduct the benchmarking in the right way. Your `vec_fun` would still be the fastest. – www Feb 09 '18 at 05:05
  • @www I don't take any criticism from your comment. I'll update my post with `pluck` – CPak Feb 09 '18 at 05:06
  • But still, your `vec_fun` is the fastest. Thanks for providing such nice solution. – www Feb 09 '18 at 05:13
2

Since your data contains solely numbers you could use a matrix rather than a data frame which is slightly faster.

mx <- matrix(c(x, y), ncol = 2, dimnames = list(1:length(x), c("x", "y")))

for (i in seq(2, nrow(mx))){
  if(mx[i, 1] %% 2 == 0){
    mx[i, 2] <- mx[i-1, 2] + z
    z <- z - mx[i, 1]
    } else {
      mx[i, 2]  <- mx[i-1, 2] 
    }
  }

mx
# x  y
# 1 1  0
# 2 2 10
# 3 3 10
# 4 4 18
# 5 5 18

microbenchmark() results:

# Unit: milliseconds
#  expr       min        lq     mean    median       uq       max neval
#    mx  8.675346  9.542153 10.71271  9.925953 11.02796  89.35088  1000
#    df 10.363204 11.249255 12.85973 11.785933 13.59802 106.99920  1000
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Should your matrix assignment be `nrow(df)`? – Kevin Arseneau Feb 08 '18 at 22:57
  • @KevinArseneau I don't think so, df isn't needed anymore, is it? – jay.sf Feb 08 '18 at 22:59
  • 1
    Not once you create your matrix, but if you try your code in a fresh session you will see you don't have the object `mx` to use in `nrow`. Alternatively, use `length` for `x` or `y`, then the `df` is completely irrelevant – Kevin Arseneau Feb 08 '18 at 23:00
  • @KevinArseneau Indeed, in this case length of one spanning vector should be sufficient. – jay.sf Feb 08 '18 at 23:03
  • 1
    thanks, I was looking to simplify the datatypes since I thought a matrix would be faster. This does help, but still uses the same basic loop strategy – George Feb 09 '18 at 01:39
2

It would be great if we can vectorize the operation on the data frame. My strategy is to calculate the z values for each row and then use cumsum to calculate the y value. The accumulate function from the package is to calculate the z values. right_join function from the function and fill function from the package is to further process the format.

library(tidyverse)

df2 <- df %>% 
  filter(x %% 2 != 0) %>%
  mutate(z = accumulate(c(z, x[-1] - 1), `-`)) %>%
  right_join(df, by = c("x", "y")) %>%
  mutate(z = lag(z), z = ifelse(is.na(z), 0, z)) %>%
  mutate(y = cumsum(z)) %>%
  select(-z)
df2
#   x  y
# 1 1  0
# 2 2 10
# 3 3 10
# 4 4 18
# 5 5 18
www
  • 38,575
  • 12
  • 48
  • 84
  • one issue I have is that the df is very large so making a copy of it feels like it will be very slow. I was looking into the purrr package and will explore that more. – George Feb 09 '18 at 01:41