R iterating by group and mapping values based on column value

Question

I have the following data frame in R:

df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
                 time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
                 target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
                 comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))

And another data frame:

data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
                   name = c('a','b','c','d','e','f','g','h','i','j'))

So, if we take a look at df we could sort the data by target and combination and we will notice that there are basically "groups". For example for target=1 and comb=0 there are four entries p1_start,p1_end,p2_start,p2_end and it is the same for all other target/comb combinations.
On the other side data contains entries with time being a timestamp.

Goal: I want to map the values from both data frames based on time.
Example: The first entry of data has time=2 meaning it happened between p1_start,p1_end so it should get the values target=1 and comb=0 mapped to the data data frame.
Example 2: The entries of data with time=14 happened between p2_start,p2_end so they should get the values target=1 and comb=1 mapped to the data data frame.

Idea: I thought I iterate over df by target and comb and for each combination of them check if there are rows in data whose time is between. The second could be done with the following command:

data[which(data$time > p1_start & data$time < p2_end),]

once I get the rows it is easy to append the values. Problem: how could I do the iteration? I tried with the following:

df %>% 
group_by(target, comb) %>% 
print(data[which(data$time > df$p1_start & data$time < df$p2_end),])

But I am getting an error that time has not been initialized

ekoam · Answer 1 · 2022-01-11T09:43:06.817

Your problem is best known as performing non-equi join. We need to find a range in some given dataframe that corresponds to each value in one or more given vectors. This is better handled by the data.table package.

We would first transform your df into a format suitable for performing the join and then join data with df by time <= end while time >= start. Here is the code

library(data.table)

setDT(df)[, c("type", "name") := tstrsplit(name, "_", fixed = TRUE)]
df <- dcast(df, ... ~ name, value.var = "time")

cols <- c("target", "comb", "type")
setDT(data)[df, (cols) := mget(paste0("i.", cols)), on = .(time<=end, time>=start)]

After dcast, df looks like this

   target comb type end start
1:      1    0   p1   3     1
2:      1    0   p2   7     5
3:      1    1   p1  11     9
4:      1    1   p2  15    13
5:      2    0   p1  19    17
6:      2    0   p2  23    21
7:      2    1   p1  27    25
8:      2    1   p2  31    29

And the output is

> data
    time name target comb type
 1:    2    a      1    0   p1
 2:    5    b      1    0   p2
 3:    8    c     NA   NA <NA>
 4:   14    d      1    1   p2
 5:   14    e      1    1   p2
 6:   20    f     NA   NA <NA>
 7:   21    g      2    0   p2
 8:   26    h      2    1   p1
 9:   28    i     NA   NA <NA>
10:   28    j     NA   NA <NA>

score 0 · Answer 2 · answered Jan 11 '22 at 10:19

Here is a tidyverse solution:

library(tidyr)
library(dplyr)
df %>% 
  rename(name_df=name) %>% 
  mutate(x = time +1) %>% 
  pivot_longer(
    cols = c(time, x), 
    names_to = "helper",
    values_to = "time"
  ) %>% 
  right_join(data, by="time") %>% 
  select(time, name, target, comb)

    time name  target  comb
   <dbl> <chr>  <dbl> <dbl>
 1     2 a          1     0
 2     5 b          1     0
 3     8 c          1     0
 4    14 d          1     1
 5    14 e          1     1
 6    20 f          2     0
 7    21 g          2     0
 8    26 h          2     1
 9    28 i          2     1
10    28 j          2     1

score 0 · Answer 3 · answered Jan 11 '22 at 13:16

df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
                 time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
                 target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
                 comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))

data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
                   name = c('a','b','c','d','e','f','g','h','i','j'))

library(fuzzyjoin)
library(tidyverse)

tmp <- df %>%
  separate(name,
           into = c("p", "period"),
           sep = "_",
           remove = TRUE) %>%
  pivot_wider(
    id_cols = c(p, target, comb),
    names_from = period,
    values_from = time
  ) %>%
  select(-p)

fuzzy_left_join(
  x = data, 
  y = tmp, 
  by = c("time" = "start", 
         "time" = "end"), 
  match_fun = list(`>=`, `<=`))
#>    time name target comb start end
#> 1     2    a      1    0     1   3
#> 2     5    b      1    0     5   7
#> 3     8    c     NA   NA    NA  NA
#> 4    14    d      1    1    13  15
#> 5    14    e      1    1    13  15
#> 6    20    f     NA   NA    NA  NA
#> 7    21    g      2    0    21  23
#> 8    26    h      2    1    25  27
#> 9    28    i     NA   NA    NA  NA
#> 10   28    j     NA   NA    NA  NA

^{Created on 2022-01-11 by the reprex package (v2.0.1)}

R iterating by group and mapping values based on column value

3 Answers3