1

I have a list of files that contain specific genes, and I want to create a binary relation matrix in R that shows the presence of each gene in each file.

For example, here are my files aaa, bbb, ccc, and ddd and the genes associated to them.

aaa=c("HERC1")
bbb=c("MYO9A", "PKHD1L1", "PQLC2", "SLC7A2")
ccc=c("HERC1")
ddd=c("MACC1","PKHD1L1")

I would like to know which command I could use in R to generate a binary relation table like the one in the following image:

enter image description here

where the value 1 means association, and the value 0 means non-association.

How can I do this operation in R?

I tried to use table(aaa,bbb,ccc,ddd) but it did not work. R said:

Error in table(aaa, bbb, ccc, ddd) : all arguments must have the same length

EDIT: Thanks @akrun for your useful reply! I'll take advantage of this question to ask help for another issue, that I'm sure you guys can handle very quickly. For the second part of my analysis, I need to generate another table that where, for each pair of genes, I assign the value 1 if both of them present in the specific file, and 0 other wise. Following the example that I gave earlier, this new table should look like the following one (I transpose it for clarify):

enter image description here

Does anybody know a quick way to obtain this new bigenic table in R, starting from the commands you guys already provided to me? Thanks!

DavideChicco.it
  • 3,318
  • 13
  • 56
  • 84

1 Answers1

2

An option would be to get the values of the object identifiers in a named list (mget), stack it to a two column data.frame and get the frequency with table

table(stack( mget(strrep(letters[1:4], 3)))[2:1])
#   values
#ind   HERC1 MACC1 MYO9A PKHD1L1 PQLC2 SLC7A2
#  aaa     1     0     0       0     0      0
#  bbb     0     0     1       1     1      1
#  ccc     1     0     0       0     0      0
#  ddd     0     1     0       1     0      0

Or an option with tidyverse

library(tidyverse)
lst(aaa, bbb, ccc, ddd) %>% 
  enframe %>% 
  unnest %>% 
  count(name, value) %>% 
  spread(value, n, fill = 0)
# A tibble: 4 x 7
#  name  HERC1 MACC1 MYO9A PKHD1L1 PQLC2 SLC7A2
#  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>
#1 aaa       1     0     0       0     0      0
#2 bbb       0     0     1       1     1      1
#3 ccc       1     0     0       0     0      0
#4 ddd       0     1     0       1     0      0

In the OP's code

table(aaa,bbb,ccc,ddd)

the length of the vectors need to be same for table to work. In addition, if we use more than 2 vectors, the frequency table will be multi-dimensional (> 2D). So, we need a framework to have the table applied on two columns instead of multiple objects

akrun
  • 874,273
  • 37
  • 540
  • 662
  • thanks, very fast! Regarding the `table(stack` solution, how do you specify the variables `aaa, bbb, ccc, ddd`? My request is actually a prototype; the real task has hundreds of variables with more complicated names. – DavideChicco.it May 15 '19 at 18:44
  • @DavideChicco.it I created the string identifiers with `strrep` and `letters`, or instead it can be also `mget(c("aaa", "bbb", .))`, e.g. `strrep("a", 3)` gives `"aaa"` – akrun May 15 '19 at 18:46
  • @DavideChicco.it If there is no pattern in the names and if the session have only the objects from those, then use `mget(ls())` Having said that, you would be having the identifiers while creating it right? – akrun May 15 '19 at 18:47
  • Off topic, but just reading the code, I've already knew who wrote it) – utubun May 15 '19 at 20:39
  • @akrun Thanks for your help; can you please help me with the new part of the question too? – DavideChicco.it May 17 '19 at 20:37
  • @DavideChicco.it You can post it as a new question – akrun May 17 '19 at 20:38
  • @akrun Okay, new question here: https://stackoverflow.com/questions/56193505/r-how-to-create-a-binary-relation-matrix-of-pair-occurrences-from-a-list-of-str – DavideChicco.it May 17 '19 at 20:59