1

A matrix cov_mat stores the covariances between variables:

        a_plane a_boat  a_train b_plane b_boat  b_train c_plane c_boat  c_train d_plane …
a_plane   4.419 -0.583    0.446  0.018  -1.291    3.159  -0.954  0.488    3.111   1.100 
a_boat   -0.583  2.636    1.813 -1.511  -0.420   -0.757   1.698  1.668    1.091   0.120 
a_train   0.446  1.813    2.668 -0.365  -0.183    1.040   1.347  1.813    0.806  -0.324 
b_plane   0.018 -1.511   -0.365  2.498   1.153    1.498  -0.465 -1.157   -0.775   0.133 
b_boat   -1.291 -0.420   -0.183  1.153   1.043   -0.194   0.243 -0.361   -0.981  -0.040 
b_train   3.159 -0.757    1.040  1.498  -0.194    4.153  -0.208  0.257    1.922   1.434 
c_plane  -0.954  1.698    1.347 -0.465   0.243   -0.208   1.791  0.909    0.259   0.394 
c_boat    0.488  1.668    1.813 -1.157  -0.361    0.257   0.909  2.290    1.572   0.269 
c_train   3.111  1.091    0.806 -0.775  -0.981    1.922   0.259  1.572    4.097   2.001 
d_plane   1.100  0.120   -0.324  0.133  -0.040    1.434   0.394  0.269    2.001   2.231 
…

final_need is a data frame with a row for each category of transportation (plane, boat, train) and a column for every possible covariance within a given category:

           aa       ab      ac     ad       ba     bb       bc     bd       ca     cb   …   <dd>
plane   4.419    0.018  -0.954  1.100    0.018  2.498   -0.465  0.133   -0.954  -0.465  …   
boat    2.636   -0.420   1.668  0.120   -0.420  1.043   -0.361  …               
train   …                                           
<…> 

To get from cov_mat to final_need, I've converted the file to an edgelist via igraph, then eliminated rows of that edgelist that included out-of-category covariance calculations (e.g.,a_planecovaries witha_boat`, but I could care less). Here's the result:

> head(cov_edgelist_slim)

   from      to covariance
a_plane a_plane      4.419
a_plane b_plane      0.018
a_plane c_plane     -0.954
a_plane d_plane      1.100
b_plane a_plane          …
…       …                …

I then try to use dcast() from reshape2, but am getting stuck on how to use the function to produce the final_need result. Any thoughts? If there's a simpler way than the one I'm heading down, I'm happy to hear it!

J.Q
  • 971
  • 1
  • 14
  • 29

2 Answers2

2

Here is another approach using base R:

  1. Extract the sub-matrix of covariances for each individual vehicle;
  2. Expand this sub-matrix to a vector;
  3. Combine the (row) vectors back into a matrix.

The additional code is to get the right column names for final_need based on pasting together the combinations of the prefixes.

## sort row + colnames in alphabetical order
cov_mat <- cov_mat[sort(rownames(cov_mat)), sort(colnames(cov_mat))]

## unique prefix and vehicle names
prefix <- unique(sub("_\\w+$", "", colnames(cov_mat)))
vehicNames <- unique(sub("^\\w+?_", "", colnames(cov_mat)))

## create final_need
final_need <- t(sapply(vehicNames, function(vehic) {           
          indices <- grep(vehic, colnames(cov_mat))      
          as.vector(cov_mat[indices, indices])
        }))

## add prefix combinations as column names
colnames(final_need) <- levels(interaction(prefix, prefix, sep = ""))

final_need
#>          aa     ba     ca     ab    bb     cb     ac     bc    cc
#> boat  2.636 -0.420  1.668 -0.420 1.043 -0.361  1.668 -0.361 2.290
#> plane 4.419  0.018 -0.954  0.018 2.498 -0.465 -0.954 -0.465 1.791
#> train 2.668  1.040  0.806  1.040 4.153  1.922  0.806  1.922 4.097

EDIT: the same can be done the other way around, i.e. extract the sub-matrix of covariances for each prefix combination and combine their diagonals back into a matrix (by column).

## create final_need by column
final_need <- apply(expand.grid(prefix, prefix), 1, function(i) {
      row_ids <- grep(sprintf("^%s_", i[1]), rownames(cov_mat))
      col_ids <- grep(sprintf("^%s_", i[2]), colnames(cov_mat))
      cov_mat[cbind(row_ids, col_ids)]
    })

## add row + column names
dimnames(final_need) <- list(vehicNames, levels(interaction(prefix, prefix, sep = "")))

final_need
#>          aa     ba     ca     ab    bb     cb     ac     bc    cc
#> boat  2.636 -0.420  1.668 -0.420 1.043 -0.361  1.668 -0.361 2.290
#> plane 4.419  0.018 -0.954  0.018 2.498 -0.465 -0.954 -0.465 1.791
#> train 2.668  1.040  0.806  1.040 4.153  1.922  0.806  1.922 4.097
Joris C.
  • 5,721
  • 3
  • 12
  • 27
  • This works on the shorter toy example I provide - awesome! But sapply is simplifying my `final_need` into a matrix composed of one list for each of my 58 vehicle types. Calling `final_need[[1]]` provides the vector of covariances for boat, `final_need[[2]]` for plane, etc. I can work with this, but any thoughts on coercing `sapply()` to behave as it did for you? – J.Q Sep 28 '19 at 13:45
  • Here's the complete data. Prefixes are "e_/p_/a_/g_/pres_". Suffixes are all "_1" and unnecessary. https://docs.google.com/spreadsheets/d/1r4UXdE3sIlvbkRZy8aMaWUwRCU00rFjZ7YY-tQ6FR4g/edit?usp=sharing – J.Q Sep 28 '19 at 14:02
  • 1
    I checked your complete data, but here the covariances between different prefixes do not all exist, e.g. `welder_1` is only present with prefix `pres` but not with any other prefix. For this reason `sapply` cannot simplify the data into a matrix and instead returns a list of vectors (with uncommon lengths). You may have to rethink the desired outputs.. – Joris C. Sep 29 '19 at 09:05
1

A tidyverse approach could be to bring rownames as column, convert data to long format. Get the first part of string from both column names as well rownames, keep only those rows where both of them match and convert to wide format.

library(tidyverse)

df %>%
  rownames_to_column() %>%
  pivot_longer(cols =-rowname) %>%
  mutate(key = paste0(sub("_.*", "", rowname), sub("_.*", "", name)), 
         rowname = sub(".*_", "", rowname), name = sub(".*_", "", name)) %>%
  filter(rowname == name) %>%
  select(-rowname) %>%
  pivot_wider(names_from = key, values_from = value)

# A tibble: 3 x 17
#  name    aa     ab     ac    ad     ba    bb     bc     bd     ca     cb ....
#  <chr> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#1 plane  4.42  0.018 -0.954   1.1  0.018  2.50 -0.465  0.133 -0.954 -0.465
#2 boat   2.64 -0.42   1.67   NA   -0.42   1.04 -0.361 NA      1.67  -0.361
#3 train  2.67  1.04   0.806  NA    1.04   4.15  1.92  NA      0.806  1.92 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    almost there - I get an error: `Error in pivot_longer(., cols = -rowname) : could not find function "pivot_longer"`. I have tidyverse running and just ran `update.packages()` for `tidyr` and `tidyverse`. – J.Q Sep 28 '19 at 13:28
  • 1
    @JMQ Can you run `install.packages("tidyr")` ? If you run `packageVersion("tidyr")` it should return `[1] ‘1.0.0’`. – Ronak Shah Sep 28 '19 at 13:32
  • something went wrong so I had to remove and reinstall tidyr, but the function is working now. Now I'm just trying to map your code to my full ex, as I'm getting warnings and lists in each cell. Here's the complete data. Prefixes are "e_/p_/a_/g_/pres_". Suffixes are all "_1" and unnecessary. https://docs.google.com/spreadsheets/d/1r4UXdE3sIlvbkRZy8aMaWUwRCU00rFjZ7YY-tQ6FR4g/edit?usp=sharing – J.Q Sep 28 '19 at 13:59