tidyr::pivot_longer to merge 2 groups of columns, each group with a different name prefix

Question

I have a dataframe with deforestation data for 543 survey sites. It has 20 columns for the values in 2001-2020 (X1, X2 etc) and another 20 columns with a measure of population density for the same years (columns pop01, pop02 etc).

> str(grid10b )
'data.frame':   543 obs. of  45 variables:
 $ X1       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ X2       : num  0.000889 0.000119 0.002048 0.00066 0.003605 ...
 $ X3       : num  0.004645 0.000612 0.007276 0.002608 0.003475 ...
 $ X4       : num  6.70e-04 8.07e-05 1.99e-03 1.19e-03 1.89e-03 ...
 $ X5       : num  0.001447 0.000183 0.00314 0.001687 0 ...
 $ X6       : num  0.000659 0.000115 0.002078 0.001113 0.000869 ...

...etc. I can merge the deforestation columns (thanks to the answer here: Reshaping longitudinal dataset with tmerge or SurvSplit?)

The code so far is:

grid10a <- grid10a %>%
  tidyr::pivot_longer(cols = starts_with('X'), values_to = 'def') %>%
  group_by(id) %>%
  mutate(tstart = row_number(),
         tstop = tstart+1) %>%
  select(-name) # otherwise there's a column with X1, X2 etc which isn't needed

...this merges the 20 columns with deforestation values into a single column 'def' and gives me 20 rows for each site ID. So far so good.

But how can I merge the population density columns? I just need to add these into a 'population' column as they're in the same year order as the values I just tidied. I need to line up the values of X1 and pop01, X2 and pop02 and so on.

I tried this next:

grid10c <- grid10b %>%
  tidyr::pivot_longer(cols = starts_with('pop'), values_to = 'popn') %>% group_by(id2)

...but ended up with a dataframe of 228,060 rows! The solution must be something like the first answer here: Reshaping multiple sets of measurement columns (wide format) into single columns (long format)

...but the use of 'names_to' and 'names_sep' isn't really explained.

Here's a dummy example of the sort of datafrae structure I have (df1) and the sort I want to build(df2):

df1 <- data.frame(ID = seq(1, 543),
                  X1 = runif(543, 0, 1),
                  X2 = runif(543, 0, 1),
                  X3 = runif(543, 0, 1),
                  X4 = runif(543, 0, 1),
                  X5 = runif(543, 0, 1),
                  X6 = runif(543, 0, 1),
                  X7 = runif(543, 0, 1),
                  X8 = runif(543, 0, 1),
                  X9 = runif(543, 0, 1),
                  X10 = runif(543, 0, 1),
                  X11 = runif(543, 0, 1),
                  X12 = runif(543, 0, 1),
                  X13 = runif(543, 0, 1),
                  X14 = runif(543, 0, 1),
                  X15 = runif(543, 0, 1),
                  X16 = runif(543, 0, 1),
                  X17 = runif(543, 0, 1),
                  X18 = runif(543, 0, 1),
                  X19 = runif(543, 0, 1),
                  X20 = runif(543, 0, 1),
                  pop01 = runif(543, 0, 100),
                  pop02 = runif(543, 0, 100),
                  pop03 = runif(543, 0, 100),
                  pop04 = runif(543, 0, 100),
                  pop05 = runif(543, 0, 100),
                  pop06 = runif(543, 0, 100),
                  pop07 = runif(543, 0, 100),
                  pop08 = runif(543, 0, 100),
                  pop09 = runif(543, 0, 100),
                  pop10 = runif(543, 0, 100),
                  pop11 = runif(543, 0, 100),
                  pop12 = runif(543, 0, 100),
                  pop13 = runif(543, 0, 100),
                  pop14 = runif(543, 0, 100),
                  pop15 = runif(543, 0, 100),
                  pop16 = runif(543, 0, 100),
                  pop17 = runif(543, 0, 100),
                  pop18 = runif(543, 0, 100),
                  pop19 = runif(543, 0, 100),
                  pop20 = runif(543, 0, 100))
df2 <- data.frame(ID = rep(1:543,each = 20),
                  def = runif(10860, 0, 1),
                  popn = runif(10860 , 0, 100))

I have a dataframe with 41 columns and 543 rows. One column is a site ID column. 20 columns are the deforestation in years 2000-2020, 20 columns are the population density in years 2000-2020. I'm trying to get a dataframe of 543x20 = 10,860 rows, one row per site per year - and with 3 columns; ID, deforestation and population. — Michael Smith, Jul 07 '21 at 17:21
Can you upload the following data, or appropriate example data: `dput(tail(your_dataframe, 20))`? — mhovd, Jul 07 '21 at 17:51
Did you see the last example in the documentation of `pivot_longer()` using the `anscombe` dataset? That sounds similar to what you are trying to do and might at least get you started. :) — aosmith, Jul 07 '21 at 17:58
@mhovd I've tried to add example data, at least it might demonstrate the shape of the dataframe. Let me know if that helps. — Michael Smith, Jul 07 '21 at 19:06
@aosmith I don't fully understand the anscombe example but if I applied it to my dataframe, wouldn't it give me 20 x 20 = 400 rows for each ID? — Michael Smith, Jul 07 '21 at 19:06
@MichaelSmith I ran your code on your sample dataset, and got a `grid10c` with 217,200 rows: equal to 543 × 20 × 20. That is, 20 × 20 = 400 rows for each of the 543 `ID`s. Your number 228,060 appears to be the product 543 × 20 × **21**. Was there exactly 1 extra column that got pivoted, along with the 20 that were intended, to yield 543 × 20 × **(1 + 20)** = 228,060? — Greg, Jul 07 '21 at 19:35
@MichaelSmith Also, if you want to match up each year with its corresponding **pair** of values `X` | `pop`, you'll need to [**pivot all at once**](https://stackoverflow.com/a/68292159). Otherwise, you'll pivot once to turn each row (of the form `X1` | `X2` | `...` | `X20` | `pop01` | `pop02` | `...` | `pop20`) into 20 new rows (of the form `X` | `pop01` | `pop02` | `...` | `pop20`) corresponding to the 20 values of `X`; and then you'll pivot once **again** to turn each of **those** rows into 20 **new** rows (of the form `X` | `pop`) corresponding to the 20 values of `pop`. — Greg, Jul 07 '21 at 19:50
@MichaelSmith Put another way, by doing the pivots sequentially rather than [all at once](https://stackoverflow.com/a/68292159), you end up getting **(1)** a particular `ID` of (say) `1` and its `X` value of (here) `X1`, combined into **20 rows** with the 20 `pop` values from {`pop01`, `pop02`, ..., `pop20`}; rather than simply **(2)** a **single row** with an `ID` of `1`, an `X` value of `X1`, and a `pop` value of `pop01`. — Greg, Jul 07 '21 at 20:01
@Greg you are right about the extra column. There are a lot of extra columns and I've been subsetting the dataframe to get just the columns I need...not always with complete success.... — Michael Smith, Jul 08 '21 at 04:54

aosmith · Accepted Answer · 2021-07-07T19:57:45.913

Since you want two new "long" columns based on the names of the two different types of measurement you'll want .value in the names_to.

Then the trickiest thing (for me) is defining the names_pattern to tell R how to create the new column names. Here the column names are based on the string starting with either X or pop and the trailing numbers are put in the year column. I convert those to integers in names_transform to bypass problems with how the numbers are in the column names (e.g., X1 vs pop01).

I reduced your example a bit so it'd be easier to see the results but it will work the same with more columns.

The total number of rows is then number of ID's*number of "years".

library(tidyr)

set.seed(16)
df1 <- data.frame(ID = seq(1, 4),
                  X1 = runif(4, 0, 1),
                  X2 = runif(4, 0, 1),
                  X3 = runif(4, 0, 1),
                  X4 = runif(4, 0, 1),
                  X5 = runif(4, 0, 1),
                  X6 = runif(4, 0, 1),
                  X7 = runif(4, 0, 1),
                  X8 = runif(4, 0, 1),
                  X9 = runif(4, 0, 1),
                  X10 = runif(4, 0, 1),
                  pop01 = runif(4, 0, 100),
                  pop02 = runif(4, 0, 100),
                  pop03 = runif(4, 0, 100),
                  pop04 = runif(4, 0, 100),
                  pop05 = runif(4, 0, 100),
                  pop06 = runif(4, 0, 100),
                  pop07 = runif(4, 0, 100),
                  pop08 = runif(4, 0, 100),
                  pop09 = runif(4, 0, 100),
                  pop10 = runif(4, 0, 100))


pivot_longer(df1, cols = -ID, 
             names_to = c(".value", "year"),
             names_pattern = "(X|pop)(.*)",
             names_transform = list(year = as.integer))
#> # A tibble: 40 x 4
#>       ID  year     X   pop
#>    <int> <int> <dbl> <dbl>
#>  1     1     1 0.683  4.97
#>  2     1     2 0.864 42.8 
#>  3     1     3 0.874 19.3 
#>  4     1     4 0.157 85.4 
#>  5     1     5 0.847 80.0 
#>  6     1     6 0.968 25.1 
#>  7     1     7 0.228 47.2 
#>  8     1     8 0.765 15.3 
#>  9     1     9 0.718 70.7 
#> 10     1    10 0.294 61.1 
#> # ... with 30 more rows

^{Created on 2021-07-07 by the reprex package (v2.0.0)}

If you don't want the "year" column you can remove via dplyr::select(). You can rename your two new columns via dplyr::rename(). Alternatively you could change the X to something more meaningful prior to converting to long. For example, use:

names(df1) <- sub("X", "def", names(df1))

Good old Tidyverse, they've come up with something even more baffling than Gather! According to the documentation, ".value indicates that component of the name defines the name of the column containing the cell values, overriding values_to." Someone should translate that into English.... — Michael Smith, Jul 08 '21 at 05:33
@MichaelSmith You could definitely `gather()` and then `spread()` instead! This is just a "one step" option. I agree that the definition of `.value` could probably use some more explanation; if you have any good ideas on how to make things clearer you might make suggestions or a pull request on the [GitHub repo](https://github.com/tidyverse/tidyr). — aosmith, Jul 08 '21 at 13:23

tidyr::pivot_longer to merge 2 groups of columns, each group with a different name prefix

1 Answers1