1

I am trying to add a list column to a tibble data frame. The resulting list column is calculated from two columns contained in the data frame and a vector which is external / independent.

Suppose that the data frame and the vector are the following:

library(dplyr)
library(magrittr)
dat <- tibble(A = c(12, 27, 22, 1, 15, 30, 20, 28, 19),
              B = c(68, 46, 69, 7, 44, 76, 72, 50, 51))

vec <- c(12, 25, 28, 58, 98)

Now, I would like to add (mutate) the column y so that for each row y is a list containing the elements of vec between A and B (inclusive).

The not-so-proper way to do this would be via loop. I initialize the column y as list and update it row-wise based on the condition A <= vec & vec <= B:

dat %<>% 
  mutate(y = list(vec))

for (i in 1:nrow(dat)){
  dat[i,]$y[[1]] <- (vec[dat[i,]$A <= vec &  vec <= dat[i,]$B])
}

The result is a data frame with y being a list of dbl of variable length:

> dat
# A tibble: 9 x 3
      A     B y        
  <dbl> <dbl> <list>   
1    12    68 <dbl [4]>
2    27    46 <dbl [1]>
3    22    69 <dbl [3]>
4     1     7 <dbl [0]>
5    15    44 <dbl [2]>
6    30    76 <dbl [1]>
7    20    72 <dbl [3]>
8    28    50 <dbl [1]>
9    19    51 <dbl [2]>

The first four values of y are:

[[1]]
[1] 12 25 28 58

[[2]]
[1] 28

[[3]]
[1] 25 28 58

[[4]]
numeric(0)

Note: the 4-th list is empty, because no value of vec is between A=1 and B=7.

I have tried as an intermediate step with getting the subscripts via which using mutate(y = list(which(A <= vec & vec <= B))) or with a combination of seq and %in%, for instance mutate(y = list(vec %in% seq(A, B))). These both give an error. However, I don't need the subscripts, I need a subset of vec.

2 Answers2

3

Create a small helper function with the logic that you want to implement.

return_values_in_between <- function(vec, A, B) {
  vec[A <= vec & vec <= B]
}

and call the function for each row (using rowwise) -

library(dplyr)

result <- dat %>%
  rowwise() %>%
  mutate(y = list(return_values_in_between(vec, A, B))) %>%
  ungroup()

result

# A tibble: 9 × 3
#      A     B    y        
#   <dbl> <dbl> <list>   
#1    12    68 <dbl [4]>
#2    27    46 <dbl [1]>
#3    22    69 <dbl [3]>
#4     1     7 <dbl [0]>
#5    15    44 <dbl [2]>
#6    30    76 <dbl [1]>
#7    20    72 <dbl [3]>
#8    28    50 <dbl [1]>
#9    19    51 <dbl [2]>

Checking the first 4 values in result$y -

result$y
#[[1]]
#[1] 12 25 28 58

#[[2]]
#[1] 28

#[[3]]
#[1] 25 28 58

#[[4]]
#numeric(0)

#...
#...
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Problem is that this solution won't work if the data frame also contains a column called `vec`. Wondering how it would look like if I explicitely tell R to look for `vec` outside of the data frame. the !! operator doesn't work. – deschen Sep 27 '22 at 10:56
  • 1
    The best practice would be to name the vector and dataframe column name differently. However, if that is not possible you may use `.GlobalEnv$vec` to use `vec` as vector and not dataframe column name. – Ronak Shah Sep 27 '22 at 11:00
  • Agree, still wondering how this could look like in the tidyverse way, i.e. where I don't create a dedicated function, but just go diorectly with `dat |> rowwise() |> mutate(y = list(vec[vec >= A & vec <= B])) |> ungroup()`. Any idea (out of curiosity)? – deschen Sep 27 '22 at 11:07
  • Use `.GlobalEnv$vec` instead of `vec` everywhere. In my answer it would look like `return_values_in_between(.GlobalEnv$vec, A, B)` – Ronak Shah Sep 27 '22 at 11:11
  • Right, that works if you are using a dedicated function. But wondering how it would look like when NOT using a function and instead trying to code it directly into the `mutate` as shown in my example. Any idea? – deschen Sep 27 '22 at 11:13
  • 1
    @deschen `dat |> rowwise() |> mutate(y = list(.GlobalEnv$vec[.GlobalEnv$vec >= A & .GlobalEnv$vec <= B])) |> ungroup()` – Ronak Shah Sep 27 '22 at 11:15
  • Ah, now I get it, tahnks – deschen Sep 27 '22 at 11:15
  • In dplyr, can also use `.env$vec` https://rlang.r-lib.org/reference/dot-data.html – IceCreamToucan Sep 27 '22 at 12:08
0

With the help of @Ronak Shah, I was able to come up with a solution that doesn't require a dedicated function and also makes sure that the vec is pulled from the global environment (in case there might be a column vec in the data frame):

library(tidyverse)
dat |> 
  rowwise() |> 
  mutate(y = list(.GlobalEnv$vec[.GlobalEnv$vec >= A & .GlobalEnv$vec <= B])) |> 
  ungroup()
deschen
  • 10,012
  • 3
  • 27
  • 50