Find closest value between vectors in data frame by group

Question

I have a dataset df:

df <- tibble(
  id = sort(rep(letters[1:3], 3)),
  visit_id = rep(c(0, 5, 10), 3),
  true_visit = c(NA, 3, NA, 0, 5, 10, 1, 7, NA)
)
> df
# A tibble: 9 x 3
  id    visit_id true_visit
  <chr>    <dbl>      <dbl>
1 a            0         NA
2 a            5          3
3 a           10         NA
4 b            0          0
5 b            5          5
6 b           10         10
7 c            0          1
8 c            5          7
9 c           10         NA

I’m trying to create a new column closest_visit where I find the true_visit that is closest to visit_id within each individual. The result would look like:

# A tibble: 9 x 4
  id    visit_id true_visit closest_visit
  <chr>    <dbl>      <dbl>         <dbl>
1 a            0         NA             3
2 a            5          3             3
3 a           10         NA             3
4 b            0          0             0
5 b            5          5             5
6 b           10         10            10
7 c            0          1             1
8 c            5          7             7
9 c           10         NA             7

To clarify, closest_visit is 3 for individual a because it's the only true_visit. closest_visit is 1 for the seventh row because 0 (the visit_id for that row) is closer to 1 than it is to 7 (the true_visits for that participant), and so on.

I tried looking here, here, and here. They were a good start but not exactly what I'm looking for. Any ideas?

Why is the `closest_visit` for the seventh row 1? Also, why are there 3 `closest_visit` values per `id b`? — tmfmnk, Nov 01 '19 at 17:55
@tmfmnk for the seventh row (id `c`), `abs(0 - c(1,7)) = c(1, 7)`, the smallest of which is `1`, which corresponds to `true_visit == 1`. For id `b`, each of their true visits matches a different visit ID, so for example `abs(5 - c(0,5,10)) = c(5, 0, 5)`, the smallest of which is `0`, which corresponds to `true_visit == 5`. Does that make sense/answer your question? — Felix T., Nov 01 '19 at 18:07
Not really. Isn't it about finding one closest value per ID? Also, why is there a 7 for the ninth row? — tmfmnk, Nov 01 '19 at 18:11
There's a 7 in the ninth row because 10 (the visit id for that row) is closer to 7 than it is to 1 (the true visits for that individual) — Felix T., Nov 01 '19 at 18:12
Try `df %>% group_by(id) %>% mutate(closest_visit = na.omit(true_visit)[findInterval(true_visit, visit_id)]) %>% fill(closest_visit, .direction = "updown")` — akrun, Nov 01 '19 at 18:24

score 1 · Answer 1 · answered Nov 01 '19 at 18:25

One option is findInterval and then fill

library(dplyr)
library(tidyr)
df %>%
   group_by(id) %>% 
   mutate(closest_visit = na.omit(true_visit)[findInterval(true_visit, 
          visit_id)]) %>% 
   fill(closest_visit, .direction = "updown")
# A tibble: 9 x 4
# Groups:   id [3]
#  id    visit_id true_visit closest_visit
#  <chr>    <dbl>      <dbl>         <dbl>
#1 a            0         NA             3
#2 a            5          3             3
#3 a           10         NA             3
#4 b            0          0             0
#5 b            5          5             5
#6 b           10         10            10
#7 c            0          1             1
#8 c            5          7             7
#9 c           10         NA             7

Thank you! Would you mind telling me what version of tidyr you are using? I'm using tidyr version 0.8.3 and the only options for `.direction` are `c("up", "down")`. — Felix T., Nov 01 '19 at 18:32
@FelixT. I used `packageVersion('dplyr') [1] ‘0.8.3’ > packageVersion('tidyr') [1] ‘1.0.0’` — akrun, Nov 01 '19 at 18:35

score 1 · Answer 2 · answered Nov 01 '19 at 18:34

This isn't the most pretty way, but it works on your example:

library(dplyr)

for (id in unique(df$id) ) {

  available_visit = na.omit(df[df$id == id ,'true_visit']) %>% pull()
  unique_id = unique(df$visit_id[df$id == id])

    for (visit_id in unique_id) {

      df[df$id == id & df$visit_id ==  visit_id, 'closest_visit' ] <- 
        available_visit[which.min(abs(available_visit-visit_id))]
      }
  }

score 1 · Accepted Answer · answered Nov 01 '19 at 18:36

Could go for:

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(
    closest_visit = case_when(
      visit_id == true_visit ~ true_visit,
      TRUE ~ true_visit[sapply(visit_id, 
                               function(x) which.min(abs(x - true_visit)))]
      )
    )

Output:

# A tibble: 9 x 4
# Groups:   id [3]
  id    visit_id true_visit closest_visit
  <chr>    <dbl>      <dbl>         <dbl>
1 a            0         NA             3
2 a            5          3             3
3 a           10         NA             3
4 b            0          0             0
5 b            5          5             5
6 b           10         10            10
7 c            0          1             1
8 c            5          7             7
9 c           10         NA             7

Find closest value between vectors in data frame by group

3 Answers3