2

I have a huge matrix with nrow=144 and ncol=156267 containing numbers and I would like to compute the correlation between all the columns. This can be done using the bigcor function described here: https://www.r-bloggers.com/bigcor-large-correlation-matrices-in-r/.

After making the bigcor function I ran:

bigcor(Mbig2, nblocks = 1611, verbose = TRUE)

This leads to the following errors:

Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") : missing value where TRUE/FALSE needed In addition: Warning message: In ff(vmode = "single", dim = c(NCOL, NCOL)) :

My questions are: 1) Is this even feasible? Is there a way of escaping the error?

smci
  • 32,567
  • 20
  • 113
  • 146
NKGon
  • 55
  • 8
  • 1
    That means a matrix of 156,267 rows and 156,267 columns. (It's symmetric, so you can divide roughly by two.) Rather large, agreed? – duffymo Aug 30 '16 at 20:56
  • Do you mean dividing by the diagonal? I thought that was a given. Yes, it is rather large. – NKGon Aug 30 '16 at 21:26
  • No, i do not mean dividing by the diagonal. Correlation matricies have 1.0 on the diagonal, by definition, and off-diagonal terms that range from -1 to 1. I don't know what you're talking about. – duffymo Aug 31 '16 at 00:14
  • There is mention of this error at http://www.bytemining.com/2010/05/hitting-the-big-data-ceiling-in-r/ and http://brainchronicle.blogspot.ca/2013/02/large-correlation-in-parallel.html - it seems to be a limitation of R. Is there a reason for calculating such a large correlation matrix? Perhaps there is a way around this if could let us know what you will be using this correlation matrix for. – jav Aug 31 '16 at 03:29
  • Thanks @jav, It seems to be a limitation of R. – NKGon Aug 31 '16 at 14:57
  • 1
    @jav, The idea behind doing this matrix was: Each column represent a block of genomic DNA that might be linked to other blocks depending on the Correlation result. I wanted to 1st compute all the possible correlation values and then explore them to analyze which blocks are most linked. Now I think this naive approach is not the correct one. I'll try to separate some selected blocks -columns- first and then compute the correlation. The downside is that this is no longer naive, as blocks are 1st selected. – NKGon Aug 31 '16 at 15:05
  • 1
    Do you know a rough cutoff for minimum correlation value? e.g. 0.05 or something? It might seriously reduce the size of the result. – smci Aug 06 '18 at 23:25

0 Answers0