How to convert a wide format data with multiple categorical variables to long format in R?

Question

Here is my initial data.

df <- data.frame(v1 = c("a", "a", "a", "a", "a", "a"),
                 v5 = c("c1", "c2", NA, "c1", "c2", NA),
                 v6 = c(20, 30, NA, 14, 26, NA),
                 v7 = c("d1", "d2", "d3", "d1", "d2", "d3"),
                 v8 = c(2, 3, 4, 5, 6, 7))

  v1   v5   v6 v7 v8
1  a   c1   20 d1  2
2  a   c2   30 d2  3
3  a <NA> <NA> d3  4
4  a   c1   14 d1  5
5  a   c2   26 d2  6
6  a <NA> <NA> d3  7

I'd like to convert it to one with long format like below.

    v1 Variable Var_category var_value
1   a       v5           c1        20
2   a       v5           c2        30
3   a       v7           d1         2
4   a       v7           d2         3
5   a       v7           d3         4
6   a       v5           c1        14
7   a       v5           c2        26
8   a       v6           d1         5
9   a       v6           d2         6
10  a       v6           d3         7

How can I do this in R?

I tried using pivot_longer() and gather() but could not find the correct way to do this since it's not either traditional long format or traditional wide format I guess.

score 2 · Answer 1 · answered Mar 30 '23 at 20:19

Here's an example using pivot_longer(), but you it's easiest if you rename the variables so the stems correspond with categories and values:

library(dplyr)
library(tidyr)
df <- data.frame(v1 = c("a", "a", "a", "a", "a", "a"),
                 v5 = c("c1", "c2", NA, "c1", "c2", NA),
                 v6 = c(20, 30, NA, 14, 26, NA),
                 v7 = c("d1", "d2", "d3", "d1", "d2", "d3"),
                 v8 = c(2, 3, 4, 5, 6, 7))


df %>% 
  rename(cat_v5 = v5, cat_v7 = v7, vals_v5 = v6, vals_v7 = v8) %>%
  pivot_longer(-v1, 
               names_pattern=c("(.*)_(.*)"), 
               names_to = c(".value", "category")) %>% 
  na.omit()
#> # A tibble: 10 × 4
#>    v1    category cat    vals
#>    <chr> <chr>    <chr> <dbl>
#>  1 a     v5       c1       20
#>  2 a     v7       d1        2
#>  3 a     v5       c2       30
#>  4 a     v7       d2        3
#>  5 a     v7       d3        4
#>  6 a     v5       c1       14
#>  7 a     v7       d1        5
#>  8 a     v5       c2       26
#>  9 a     v7       d2        6
#> 10 a     v7       d3        7

^{Created on 2023-03-30 with reprex v2.0.2}

score 2 · Answer 2 · answered Mar 30 '23 at 21:50

Base R reshape deals with this slightly unique circumstance pretty flexibly, if you make yourself a list showing what variables go to what values:

vars <- list(Var_category=c("v5","v7"), Var_value=c("v6","v8"))
out <- na.omit(reshape(df, varying=vars, v.names=names(vars), direction="long",
                       times=vars[[1]], timevar="Variable"))
out

##     v1 Variable Var_category Var_value id
##1.v5  a       v5           c1        20  1
##2.v5  a       v5           c2        30  2
##4.v5  a       v5           c1        14  4
##5.v5  a       v5           c2        26  5
##1.v7  a       v7           d1         2  1
##2.v7  a       v7           d2         3  2
##3.v7  a       v7           d3         4  3
##4.v7  a       v7           d1         5  4
##5.v7  a       v7           d2         6  5
##6.v7  a       v7           d3         7  6

Bonus is if you want to revert back to wide including all the naming, just use the magic:

reshape(out)
##     v1 id   v5 v6 v7 v8
##1.v5  a  1   c1 20 d1  2
##2.v5  a  2   c2 30 d2  3
##4.v5  a  4   c1 14 d1  5
##5.v5  a  5   c2 26 d2  6
##3.v7  a  3 <NA> NA d3  4
##6.v7  a  6 <NA> NA d3  7

score 1 · Answer 3 · answered Mar 30 '23 at 20:18

In this case you can simply rbind columns of the data frame

df <- data.frame(v1 = c("a", "a", "a", "a", "a", "a"),
                 v5 = c("c1", "c2", NA, "c1", "c2", NA),
                 v6 = c(20, 30, NA, 14, 26, NA),
                 v7 = c("d1", "d2", "d3", "d1", "d2", "d3"),
                 v8 = c(2, 3, 4, 5, 6, 7))
na.omit(
  rbind(
    with(df, data.frame(v1 = v1, Variable = "v5", Var_category=v5, var_value=v6)), 
    with(df, data.frame(v1 = v1, Variable = "v7", Var_category=v7, var_value=v8))) )
#>    v1 Variable Var_category var_value
#> 1   a       v5           c1        20
#> 2   a       v5           c2        30
#> 4   a       v5           c1        14
#> 5   a       v5           c2        26
#> 7   a       v7           d1         2
#> 8   a       v7           d2         3
#> 9   a       v7           d3         4
#> 10  a       v7           d1         5
#> 11  a       v7           d2         6
#> 12  a       v7           d3         7

Thanks for the suggestion! The issue is my actual data includes much more variables than the one I have here, I tried to find a more general way to do this. But if I cannot find the appropriate way eventually, I might need to restructure it using rbind — Joy, Apr 03 '23 at 14:15

How to convert a wide format data with multiple categorical variables to long format in R?

3 Answers3