How to scale up a transmute in tidyverse?

Question

I have a tibble with many variables organised this way:

tibble(
  A = rep("A",10),
  xyz1 = rnorm(10),
  xyz2 = rnorm(10),
  xyz3 = rnorm(10),
  abc1 = rnorm(10),
  abc2 = rnorm(10),
  abb3 = rnorm(10),
  acc4 = rnorm(10)
)

where xyz, abc, etc. are placeholder. After the placeholder there is a number. Assume it can be any integer. In my tibble, that number can be any integer.

I want to trasmute it according to the formula

xyzn = xyzn - 'xyzn-1', where n is the symbol for the counted integer.

Whereas 'xyzn-1' does not exist, the result can be ignored and not join the transmute.

Schematic output:

tibble(
  A = A
  xyz2 = xyz2 - xyz1,
  xyz3 = xyz3 - xyz2,
  abc2 = abc2 - abc1
)

If xyz2 = 4 and xyz1 = 3, then after the transmute xyz2 = 1 and and xyz1 = NA — GiulioGCantone, Jan 06 '23 at 18:26
Whereas 'xyzn-1' does not exist, the result can be ignored and not join the transmute. I mean, if the result is a NA column is not a issue, since it can be removed later. — GiulioGCantone, Jan 06 '23 at 18:30
It is an acceptable output, but I will removed it later. Notice that there will be no ’acc’ column since the numerical columns all contains an integer, so ’acc4’. Also the first column should be preserved in the transmute. — GiulioGCantone, Jan 06 '23 at 18:34
Many less columns will be left, because the token before the n integer counts. — GiulioGCantone, Jan 06 '23 at 18:39

akrun · Answer 1 · 2023-01-06T19:31:40.013

Perhaps this helps

library(dplyr)
library(tidyr)
df1 %>% 
  mutate(rn = row_number()) %>% 
  pivot_longer(cols = -c(A, rn), names_to = c(".value", "ind"), 
   names_pattern = "(\\D+)(\\d+)",
    names_transform = list(ind = as.integer)) %>% 
  arrange(A, rn, ind) %>% 
  group_by(A, rn) %>%
  mutate(across(-ind, ~ c(NA, diff(.x)))) %>%
  ungroup %>% 
 pivot_wider(names_from = ind, values_from = xyz:acc,
   names_sep = "") %>% 
 select(-rn) %>% 
 select(where(~ any(complete.cases(.x))))

-output

# A tibble: 10 × 4
   A       xyz2   xyz3    abc2
   <chr>  <dbl>  <dbl>   <dbl>
 1 A     -1.60   1.75  -1.53  
 2 A      2.89  -3.81   0.0701
 3 A     -0.657  0.920 -0.912 
 4 A      0.305  0.395 -0.477 
 5 A     -0.289  1.39   1.38  
 6 A     -0.103  0.426 -1.38  
 7 A     -2.16   1.44  -0.913 
 8 A     -0.260 -0.249 -1.45  
 9 A     -1.15   1.14  -1.42  
10 A     -0.306  0.198 -0.118

Or may be

cbind(df1[1],  do.call(cbind, unname(Filter(nrow, lapply(split.default(df1[-1], 
sub("\\d+$", "", names(df1)[-1])), 
     \(x) {
    i1 <- order(as.integer(sub("\\D+", "", names(x))))
    x <- x[i1]
    x[-1]- x[-ncol(x)]
   })))
))

-output

  A        abc2       xyz2       xyz3
1  A -1.52667071 -1.5985160  1.7533450
2  A  0.07013841  2.8939503 -3.8113492
3  A -0.91213998 -0.6573093  0.9197824
4  A -0.47712113  0.3049918  0.3945995
5  A  1.37871603 -0.2886773  1.3933839
6  A -1.37608993 -0.1031296  0.4264927
7  A -0.91313982 -2.1630265  1.4407289
8  A -1.45439105 -0.2598476 -0.2493127
9  A -1.41590040 -1.1490018  1.1383060
10 A -0.11775196 -0.3061306  0.1984115

A note: while columns can be reordered to an alphabetical criterion, do not assume that they are pre-ordered. So it is not just a "subtract the left column from the right column". Anyway I am sure that I can sort out a code to alphabetically order the numeric columns. — GiulioGCantone, Jan 06 '23 at 18:44
The code in the second block does something different, for example, in the case of xyz5, xyz4, xyz2, xyz1, it handles xyz5-xyz4 correctly, but xyz2-xyz1 uncorrectly; since instead it assigns the value of xyz4-xyz2, which is not required since xyz3 does not exist. Also, it seems to mess with the names of the trasnmute columns. — GiulioGCantone, Jan 06 '23 at 18:54
Updated with schematic output. There is no xyz4 or xyz5 because it is a scheme. Tokens can be any tokens, integers any integers. — GiulioGCantone, Jan 06 '23 at 19:13

score 1 · Answer 2 · answered Jan 07 '23 at 11:13

Usually the {dplyover} package can help with this kind of problems (disclaimer: I'm the maintainer). However, in your specific case the problem is not that easy to solve due to he specific conditions for variable selection.

In the approach below we first construct the variable names that we want to subtract from each other myvars1 and myvars2.

After that we can use dplyover::across2() together with all_of().

See the code comments for what we do in each step:

library(dplyr)
library(stringr)
library(dplyover) # https://timteafan.github.io/dplyover/


# get all variable stems
all_stems <- dplyover::cut_names("[0-9]$", .vars = names(df1))

# exlcude stems which don't start with 1
use_stems <- all_stems[paste0(all_stems, 1) %in% names(df1)]

# construct regex pattern to select all vars with > 1
patrn <- paste0("(", paste(use_stems, collapse = "|"), ")[^1]$")

# select vars with > 1
myvars1 <- grep(patrn, names(df1), value = TRUE)
# select vars to substract from `myvars1`
myvars2 <- str_replace(myvars1, "\\d$", ~ as.numeric(.x) - 1)

# use `dplyover::across2()` with `all_of()`
df1 %>% 
  transmute(
    A = A, # dplyover doesn't support the `.keep` argument so we need a workaround
    across2(all_of(myvars1),
            all_of(myvars2),
            ~ .x - .y,
            .names = "{xcol}")
    )
#> # A tibble: 10 × 4
#>    A       xyz2    xyz3   abc2
#>    <chr>  <dbl>   <dbl>  <dbl>
#>  1 A      0.847 -1.19    0.413
#>  2 A      1.00   0.946  -3.26 
#>  3 A      0.856 -1.11   -2.62 
#>  4 A     -0.325  1.47    1.11 
#>  5 A     -1.18   0.0830  2.78 
#>  6 A     -2.65  -0.520  -0.337
#>  7 A      0.197 -0.447   0.347
#>  8 A     -0.484  1.18   -0.717
#>  9 A     -1.94   1.81    1.05 
#> 10 A     -3.80   1.36    0.202

The from OP:

df1 <- tibble(
  A = rep("A",10),
  xyz1 = rnorm(10), #   2 
  xyz2 = rnorm(10), # 1 2
  xyz3 = rnorm(10), # 1
  abc1 = rnorm(10), #   2
  abc2 = rnorm(10), # 1
  abb3 = rnorm(10), #
  acc4 = rnorm(10)  #
)

^{Created on 2023-01-07 with reprex v2.0.2}

`dplyover` seems a very nice package. I found an alternative solution with cycles of `pivot_widers`, inspired by @akrun, which works exactly as it should; however, I also found a workaround from the data source that re-patterns columns as `xyz_n` instead of `xyzn`. I prefer my solution because it is a unique pipeline, which is easier to debug. — GiulioGCantone, Jan 07 '23 at 12:58
@GiulioGCantone maybe you can post your approach as a separate answer, would be great to have it here with the other solutions! — TimTeaFan, Jan 07 '23 at 14:20

How to scale up a transmute in tidyverse?

2 Answers2