How to group columns together based on commonality in name, create a combined new column, eliminating NAs?

Question

I have some data similar to this example data, where columns stand for years:

data <- data.frame(
  X2020_1 = c("A", NA, "B"),
  X2020_2 = c("A", "C", NA),
  X2021_1 = c("A", NA, "C"),
  X2021_2 = c(NA, NA, "A")
)

which looks like this:

  X2020_1 X2020_2 X2021_1 X2021_2
1       A       A       A    <NA>
2    <NA>       C    <NA>    <NA>
3       B    <NA>       C       A

I want to group the data by columns based on the column name. I want anything with X2020 in its name to be in a new X2020 column, and anything with X2021 to be in a new X2021 column. I also want to eliminate NA values unless there are only NAs in that column-group.

My desired end result data should look like this:

    X2020   X2021
1 c(A, A)       A
2       C    <NA>
3       B c(C, A)

I'm not really sure how to get there.

score 2 · Accepted Answer · answered Jun 25 '23 at 01:56

I have two solutions, the first one is more manual and the second one is more dynamic.

unite()

From the tidyr package, there's an unite() function that allows you to combine multiple columns into a single one. But that requires one unite function per pattern (or year in this case).

library(tidyverse)

data %>% 
  unite("X2020", starts_with("X2020"), na.rm = T, sep = ", ") %>% 
  unite("X2021", starts_with("X2021"), na.rm = T, sep = ", ") %>% 
  mutate(across(everything(), ~ifelse(.x == "", NA, .x)))

Reshape data

For the second solution, you can first reshape your data into a "long" format, then summarise the column per row number and "year" to concatenate the values, and reshape back to "wide" format.

This is more dynamic and do not require hard-coding patterns.

library(tidyverse)

data %>% 
  mutate(rn = row_number()) %>% 
  pivot_longer(-rn) %>% 
  mutate(name = sub("_\\d", "", name)) %>% 
  group_by(name, rn) %>% 
  summarize(value = toString(na.omit(value))) %>%
  pivot_wider() %>% 
  select(-rn) %>% 
  mutate(across(everything(), ~ifelse(.x == "", NA, .x)))

Output

Both solutions give the same output.

# A tibble: 3 × 2
  X2020 X2021
  <chr> <chr>
1 A, A  A    
2 C     NA   
3 B     C, A

The first solution is so elegant. Def gonna add `unite()` to my toolbox. — Harry Smith, Jun 25 '23 at 02:04

How to group columns together based on commonality in name, create a combined new column, eliminating NAs?

1 Answers1

unite()

Reshape data

Output