Automatically compute and store a lot of variables in a dataframe

Question

I have a list called res.Sigma containing 100 covariance matrices of order 4*4, where each matrix corresponds to a fixed time. The values of the time variable are stored in a variable called X.

Now, we can compute the time-varying sample variances and store them in a dataframe as follows:

sample.var <- data.frame(X = X,
                         v11 = res.Sigma %>% lapply(function(val) val[1,1]) %>% unlist(),
                         v22 = res.Sigma %>% lapply(function(val) val[2,2]) %>% unlist(),
                         v33 = res.Sigma %>% lapply(function(val) val[3,3]) %>% unlist(),
                         v44 = res.Sigma %>% lapply(function(val) val[4,4]) %>% unlist())

Similarly, the time-varying sample correlations can be computed and stored as follows:

sample.corr <- data.frame(X = X,
                          r12 = res.Sigma %>% lapply(function(val) 
                                      val[1,2]/sqrt(val[1,1]*val[2,2])) %>% unlist(),
                          r13 = res.Sigma %>% lapply(function(val) 
                                      val[1,3]/sqrt(val[1,1]*val[3,3])) %>% unlist(),
                          r14 = res.Sigma %>% lapply(function(val) 
                                      val[1,4]/sqrt(val[1,1]*val[4,4])) %>% unlist(),
                          r23 = res.Sigma %>% lapply(function(val) 
                                      val[2,3]/sqrt(val[2,2]*val[3,3])) %>% unlist(),
                          .
                          .
                          .
                          )

There will be 4C2 = 6 columns for sample correlations between each pair of variables. However, if there were 20 variables instead of 4, there would be 20C2 = 190 such columns which will be impossible to write manually as I have written above.

I'm looking for a way to do this automatically, and name the columns appropriately (r_{i,j} for the column containing correlations between the i-th and the j-th variables).

Any help will be appreciated. Let me know if my question is unclear.

Could you please make the question reproducible, perhaps with some small fake data? — Jon Spring, Jul 14 '23 at 05:32

Mark · Answer 1 · 2023-07-14T07:14:16.127

Here is another way (I prefer Jan's answer though):

library(tidyverse)

# Create a toy dataset
df <- tibble(
  a = rnorm(100), # initialise the tibble with one column
)
for (i in letters[2:20]) {
  df[[i]] <- rnorm(100) # add the letter columns b to t
}

combinations <- combn(names(df), 2, simplify = FALSE) # find all ways of choosing 2 from 20 columns
comb_names <- map_chr(combinations, ~ paste0("r_{", paste(.x, collapse = ","), "}")) # create the final names of those columns

# combine it all together
comb_df <- setNames(map(combinations, ~ df %>%
             select(all_of(.x)) %>%
             rowSums(na.rm = TRUE)), comb_names) %>% # I'm finding the rowSums in this example, covariance matrices would be similar
             as_tibble()

# A tibble: 100 × 190
   `r_{a,b}` `r_{a,c}` `r_{a,d}` `r_{a,e}` `r_{a,f}` `r_{a,g}` `r_{a,h}`
       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1     2.27     -0.715     0.626     0.847     1.90    -1.35       0.651
 2     2.97      1.76      0.581     1.79     -0.202    1.63       1.70 
 3     0.415     0.965     0.338    -0.154     0.222   -0.0943    -0.221
 4     0.995     1.01      1.36      0.962     1.39    -0.0976     1.16 
 5     1.54      0.373     1.56      0.143     1.80     1.40      -1.05 
 6     1.72      2.96      1.30      1.23      0.871    1.20       2.22 
 7     0.919     0.874     0.384     1.84      3.39     1.17       3.08 
 8    -0.384    -0.871     1.22      0.726     0.668   -0.152      0.295
 9     0.185    -1.18     -0.177     0.454     0.638   -0.538     -0.304
10    -1.33     -0.639    -0.539    -2.34     -1.28    -1.56      -0.691
# ℹ 90 more rows
# ℹ 183 more variables: `r_{a,i}` <dbl>, `r_{a,j}` <dbl>, `r_{a,k}` <dbl>,
#   `r_{a,l}` <dbl>, `r_{a,m}` <dbl>, `r_{a,n}` <dbl>, `r_{a,o}` <dbl>,
#   `r_{a,p}` <dbl>, `r_{a,q}` <dbl>, `r_{a,r}` <dbl>, `r_{a,s}` <dbl>,
#   `r_{a,t}` <dbl>, `r_{b,c}` <dbl>, `r_{b,d}` <dbl>, `r_{b,e}` <dbl>,
#   `r_{b,f}` <dbl>, `r_{b,g}` <dbl>, `r_{b,h}` <dbl>, `r_{b,i}` <dbl>,
#   `r_{b,j}` <dbl>, `r_{b,k}` <dbl>, `r_{b,l}` <dbl>, `r_{b,m}` <dbl>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Thanks a lot for your answer. It really helped. – JRC Jul 29 '23 at 09:11 — JRC, Jul 29 '23 at 09:11

Jan · Accepted Answer · 2023-07-14T07:30:37.773

Here is a tidyverse example where the columns are generated dynamically. First, we define two functions which will generate the columns and then we apply them to the data frame.

library(tidyverse)

# create the mutate functions
create_var_col <- function(df, n, mat) {
    mutate(df, "v{n}{n}" := map(mat, function(x)
        x[n, n]))
}

create_corr_col <- function(df, j, k, mat) {
    mutate(df, "r{j}{k}" := map(mat, function(x)
        x[j, k] / (x[j, j] * x[k, k])))
}

Application:

# initalize res.Sigma and X
nrow <- 4

res.Sigma <- lapply(seq_len(100), function(x)
    matrix(runif(16), nrow = nrow))

X <- 1:100

# initialize output data frames
sample.var <- data.frame(
    X = X
)

sample.corr <- data.frame(
    X = X
)

# populate the data frames with new columns
for (n in 1:nrow) {
    sample.var <- create_var_col(sample.var, n, res.Sigma)
}

combs <- combn(nrow, 2)

for (i in (seq_len(ncol(combs)))) {

    j <- combs[1, i]
    k <- combs[2, i]
    
    sample.corr <- create_corr_col(sample.corr, j, k, res.Sigma)
}

Results:

head(sample.var)

> head(sample.var)
  X       v11       v22          v33        v44
1 1 0.3770457 0.1360691 0.8865960506 0.22080969
2 2 0.7450246 0.6379828 0.0007324344 0.15418729
3 3 0.4506430 0.5700204 0.7727197134 0.76241635
4 4 0.3577171 0.7811212 0.2732931704 0.12555681
5 5 0.9532550 0.9781340 0.2275580675 0.03497771
6 6 0.0226325 0.3912557 0.4199170584 0.26492276

> head(sample.corr)
  X         r12        r13        r14         r23        r24          r34
1 1  17.6002993   1.923644  0.8184414   2.3721132 21.9337539    0.6569090
2 2   0.6498641 817.358339  4.9511748 299.1091543  0.6651190 4276.9960394
3 3   0.6825716   2.117533  0.8900186   1.9256916  1.9253819    0.6588873
4 4   0.8280786   8.324110  1.6634117   1.7292685  2.0207334    9.9122315
5 5   0.9520840   3.642721  3.5012283   2.2722650  0.2582217  116.3901019
6 6 100.8714017  99.157062 97.8675581   0.5384571  5.6053657    2.5930488

a few ideas: 1. the column syntax appears to be slightly off - I think OP was looking for r_{i,j}, so r{i}{j} is slightly different. 2. if you wanted to be very tidyverse, you could use map instead of lapply — Mark, Jul 14 '23 at 07:12
Thank you for your feedback, I changed `lapply` to `map`. Regarding the column syntax I sticked to the one provided in the question of OP ... — Jan, Jul 14 '23 at 07:17
true! I didn't see that the OP had done that...I suppose OP will just have to make sure they never use more than 10 variables, otherwise things might start getting confusing — Mark, Jul 14 '23 at 07:22
I think with the map, you can remove the `unlist()`s and it still works the same — Mark, Jul 14 '23 at 07:24

Automatically compute and store a lot of variables in a dataframe

2 Answers2

Linked