1

Suppose I have a dataframe like this:

ID sp1 sp2 sp3
1  NA   1   1
2  0    0   1
3  1    NA  0
4  1    1   1

Here is what I wanted to get:

ID 1 2 3 4
1  2 1 0 2
2  1 1 0 1
3  0 0 1 1
4  2 1 1 3

which shows the number of times two columns have the same value 1 here.

As the original dataframe is quite large, I hope to find a efficient way to address this.

Thank you very much for any efforts.

YannZ
  • 99
  • 1
  • 6

1 Answers1

2

In order to create a co-occurrence matrix from your data, you first need to convert your NAs into 0s, then do a cross-product of your data without the first ID column:

x = data.frame(ID = c(1:4), sp1 = c(NA,0,1,1), sp2 = c(1,0,NA,1), sp3 = c(1,1,0,1))
x[is.na(x)] = 0
crossprod(t(x[-1]))

     [,1] [,2] [,3] [,4]
[1,]    2    1    0    2
[2,]    1    1    0    1
[3,]    0    0    1    1
[4,]    2    1    1    3
Lamia
  • 3,845
  • 1
  • 12
  • 19
  • Thanks @Lamia I think this what I need. But maybe my data were too large, it turned out with the error 'cannot allocate vector of size 500Gb'. Thank you all the same and I'll try to subset the datasets. – YannZ Apr 17 '20 at 18:21
  • Have a look at crossprod of sparse matrices in the `Matrix` package. It should be more memory efficient. – Lamia Apr 17 '20 at 18:37