Coalesce pairs of variables within a dataframe based on a regular expression

Question

I want to use dplyr::coalesce to find the first non-missing value between pairs of variables in a dataframe containing multiple pairs of variable. The goal is to create a new dataframe with now only one copy for each pair of variable (a coalesce variable without NA values).

Here is an example:

df <- data.frame(
      A_1=c(NA, NA, 3, 4, 5),
      A_2=c(1, 2, NA, NA, NA),
      B_1=c(NA, NA, 13, 14, 15),
      B_2=c(11, 12, NA, NA, NA))


Expected output: 

A  B
1  11
2  12
3  13
4  14
5  15

I am guessing a mix of dplyr::coalesce with maybe dplyr::mutate_at based on regular expression could be use but I am not sure how to do it. Is there a way to complete this task with the tidyverse grammar?

Thanks!

EDIT: thanks everyone for your answers! However, I should have included the naming convention for my variables to facilitate the transfer of your answers to my actual problem.. I am sorry about that. My variables are geochemistry variables named in two parts (name of chemical element plus name of core).

Example: Al_TAC4.25.275 where Al is the element and TAC4.25.275 is the core. I want to coalesce the data from 3 different cores (second part of name) for each element (first part of name). I have 25 pairs of element to coalesce.

jared_mamrot · Accepted Answer · 2021-12-22T04:40:45.060

You could use transmute, e.g.

library(dplyr)

df <- data.frame(
  A_1 = c(NA, NA, 3, 4, 5),
  A_2 = c(1, 2, NA, NA, NA),
  B_1 = c(NA, NA, 13, 14, 15),
  B_2 = c(11, 12, NA, NA, NA)
  )

df %>%
  transmute(A = coalesce(A_1, A_2),
            B = coalesce(B_1, B_2))
#>   A  B
#> 1 1 11
#> 2 2 12
#> 3 3 13
#> 4 4 14
#> 5 5 15

^{Created on 2021-12-22 by the reprex package (v2.0.1)}

Another option, if you have lots of "A_*" and "B_*" columns (source: Romain François, user: @Romain Francois):

library(dplyr)

df <- data.frame(
  A_1 = c(NA, NA, 3, 4, 5),
  A_2 = c(1, 2, NA, NA, NA),
  B_1 = c(NA, NA, 13, 14, 15),
  B_2 = c(11, 12, NA, NA, NA)
  )

coacross <- function(...) {
  coalesce(!!!across(...))
}

df %>%
  transmute(A = coacross(starts_with("A_")),
            B = coacross(starts_with("B_")))
#>   A  B
#> 1 1 11
#> 2 2 12
#> 3 3 13
#> 4 4 14
#> 5 5 15

^{Created on 2021-12-22 by the reprex package (v2.0.1)}

Edit

Based on your updated question, you don't have lots of "A_*" or "B_*" columns, but instead lots of "*_1", "*_2", and "*_3" columns. I think this is the most straightforward solution for your use-case:

library(dplyr)

df <- data.frame(Al_TAC4.25.275 = c(1, 1, 1, NA, NA, NA),
                 Al_TAC4.25.276 = c(NA, NA, 2, 2, 2, NA),
                 Al_TAC4.25.277 = c(NA, NA, 3, NA, NA, 3),
                 Au_TAC4.25.275 = c(1, 1, 1, NA, NA, NA),
                 Au_TAC4.25.276 = c(NA, NA, 2, 2, 2, NA),
                 Au_TAC4.25.277 = c(NA, NA, 3, NA, NA, NA),
                 Ar_TAC4.25.275 = c(1, 1, 1, NA, NA, 1),
                 Ar_TAC4.25.276 = c(NA, NA, 2, 2, 2, 2),
                 Ar_TAC4.25.277 = c(NA, NA, 3, NA, NA, 3))

df
#>   Al_TAC4.25.275 Al_TAC4.25.276 Al_TAC4.25.277 Au_TAC4.25.275 Au_TAC4.25.276
#> 1              1             NA             NA              1             NA
#> 2              1             NA             NA              1             NA
#> 3              1              2              3              1              2
#> 4             NA              2             NA             NA              2
#> 5             NA              2             NA             NA              2
#> 6             NA             NA              3             NA             NA
#>   Au_TAC4.25.277 Ar_TAC4.25.275 Ar_TAC4.25.276 Ar_TAC4.25.277
#> 1             NA              1             NA             NA
#> 2             NA              1             NA             NA
#> 3              3              1              2              3
#> 4             NA             NA              2             NA
#> 5             NA             NA              2             NA
#> 6             NA              1              2              3

names(df) %>% 
  split(str_extract(., '[:alpha:]+')) %>%
  map_dfc(~ coalesce(!!!df[.x][c(1,2,3)]))
#> # A tibble: 6 × 3
#>      Al    Ar    Au
#>   <dbl> <dbl> <dbl>
#> 1     1     1     1
#> 2     1     1     1
#> 3     1     1     1
#> 4     2     2     2
#> 5     2     2     2
#> 6     3     1    NA

# change the order of the list to change the 'priority'
names(df) %>% 
  split(str_extract(., '[:alpha:]+')) %>%
  map_dfc(~ coalesce(!!!df[.x][c(3,2,1)]))
#> # A tibble: 6 × 3
#>      Al    Ar    Au
#>   <dbl> <dbl> <dbl>
#> 1     1     1     1
#> 2     1     1     1
#> 3     3     3     3
#> 4     2     2     2
#> 5     2     2     2
#> 6     3     3    NA

names(df) %>% 
  split(str_extract(., '[:alpha:]+')) %>%
  map_dfc(~ coalesce(!!!df[.x][c(2,1,3)]))
#> # A tibble: 6 × 3
#>      Al    Ar    Au
#>   <dbl> <dbl> <dbl>
#> 1     1     1     1
#> 2     1     1     1
#> 3     2     2     2
#> 4     2     2     2
#> 5     2     2     2
#> 6     3     2    NA

^{Created on 2021-12-22 by the reprex package (v2.0.1)}

You could: `df %>% transmute(across(ends_with("_1"), ~coalesce(., get(str_replace(cur_column(), '1$', '2'))), .names = "{.col}_coalesce")) %>% rename_at(vars(ends_with('coalesce')), ~ str_remove(., "\\_1"))` to avoid the function?! or to avoid `coalesce` twice. — TarJae, Dec 21 '21 at 23:18
Yeah! I tried this approach but couldn't figure out the fine details - thanks @TarJae! I think it would be worth adding that to your answer :) — jared_mamrot, Dec 21 '21 at 23:55
This works wonderfully thanks :) But if I may ask for some more clarification with respect to my problem: I have about 25 pairs or variable (actually, each 25 variables is repeated 3 times; example: there are variables `A_1`, `A_2`, `A_3`...`Z_1`, `Z_2`, `Z_3` and I want to coalesce to `A`,... `Z` as to reduce the number of variable. Is there a way to generalize your code to avoid specifying `A = ...`, `B = ....` ? Secondly, is there a way to built in a priority in selecting the variable to infill the coalesce variable? For example if I want in priority to have `_2` over `_1` as infill? — Antoine Lachance, Dec 22 '21 at 00:36
I've updated my answer with a potential solution @AntoineLachance — jared_mamrot, Dec 22 '21 at 04:45

Anoushiravan R · Answer 2 · 2022-10-11T22:06:36.047

Here is another more concise solution, comparing to my other one. I think the use of cur_data() function is very helpful but you could also use across(everything()) in its place:

library(dplyr)
library(purrr)

unique(sub("(\\D)_\\d+", "\\1", names(df))) %>%
  map_dfc(~ df %>%
            select(starts_with(.x)) %>%
             summarise(!!.x := do.call(coalesce, cur_data())))

  A  B
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15

Here is another solution for as many pairs as possible. Just note that I used bang bang operator (!!!) in order to collapse elements of a data frame into standalone single arguments so that I could apply coalesce on them:

library(dplyr)
library(rlang)

as.data.frame(do.call(cbind, lapply(split.default(df, sub("(\\D)_\\d+", "\\1", names(df))), function(x) {
  coalesce(!!!x)
})))

  A  B
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15

Fell very happy to see you here! – TarJae Dec 21 '21 at 22:57 — TarJae, Dec 21 '21 at 22:57
It is always my pleasure brother :) – Anoushiravan R Dec 21 '21 at 22:58 — Anoushiravan R, Dec 21 '21 at 22:58

score 5 · Answer 3 · answered Dec 21 '21 at 22:49

5

A base R option

list2DF(
  lapply(
    split.default(df, gsub("_.*", "", names(df))),
    rowSums,
    na.rm = TRUE
  )
)

gives

answered Dec 21 '21 at 22:49

ThomasIsCoding

96,636
9
24
81

2

base R brothers! you know I am enamored by `igraph` and network analysis. It totally changed my perspective on how I look at things :) – Anoushiravan R Dec 21 '21 at 22:59
2

@AnoushiravanR Good to hear that. Enjoy your `igraph` trip :) – ThomasIsCoding Dec 21 '21 at 23:01

score 4 · Answer 4 · answered Dec 21 '21 at 22:56

Here is an alternative with pivoting:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(
    everything()
  ) %>% 
  mutate(name = substr(name, 1, 1)) %>% 
  na.omit %>% 
  pivot_wider(
    names_from = name,
    values_from = value,
    values_fn = list
  ) %>% 
  unnest(cols = c(A, B))

      A     B
  <dbl> <dbl>
1     1    11
2     2    12
3     3    13
4     4    14
5     5    15

kybazzi · Answer 5 · 2021-12-22T03:02:27.330

Edit: I believe this solution continues to work even after your edit. It works regardless of the number of elements or the number of cores per element. You just need to make sure things are named consistently, in the form "{element}_{core}".

library(tidyverse)
df %>% 
  mutate(id = 1:n()) %>% 
  pivot_longer(-id) %>% 
  filter(!is.na(value)) %>% 
  mutate(variable = str_extract(name, "^[^_]+")) %>% 
  group_by(id, variable) %>% 
  # Arrange by name (e.g. A_1) so that we could select the first non-NA
  arrange(name) %>% 
  summarise(value = value[1]) %>% 
  pivot_wider(names_from = "variable")

Output

# A tibble: 5 x 3
     id     A     B
  <int> <dbl> <dbl>
1     1     1    11
2     2     2    12
3     3     3    13
4     4     4    14
5     5     5    15

score 2 · Answer 6 · answered Dec 21 '21 at 22:48

I have asked for it here: https://github.com/tidyverse/dplyr/issues/6109 where there are some possible solutions. For example

library(dplyr)
library(purrr)
df %>% 
    transmute(map2_dfc(.x = across(ends_with("_1"), .names = '{sub("_1","",.col)}'), 
                    .y = across(ends_with("_2")), 
                    .f = coalesce))
  A  B
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15

Or also using the function

coalesce_prefix <- function(prefix) {
  exprs <- map(prefix, function(p) {
    expr(coalesce(
      !!sym(paste0(p, ".x")),
      !!sym(paste0(p, ".y"))
    ))
  })
  names(exprs) <- prefix
  exprs
}

Coalesce pairs of variables within a dataframe based on a regular expression

6 Answers6

Edit