1

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.

My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.

Concepts I have tried to code: (all of them in some way give me memory issues or run forever)

  1. The most simple way, one row against all, write the result out using append.T. (Runs forever)
  2. biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
  3. split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
  4. I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
  5. I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.

I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?

Thanks for all your comments and help.

  • A correlation matrix of size 411000 x 411000. That is 1.68921e+11 elements. The maximum number of element in an ff vector is 2147483647. Your object would be +/- 50 times that maximum. You should rethink what you want to do with that correlation matrix. –  Jun 06 '14 at 08:18

1 Answers1

1

I'm familiar with this problem myself in the context of genetic research.

If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).

If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

Andrey Shabalin
  • 4,389
  • 1
  • 19
  • 18
  • I do have a server with 32 nodes which I can use to run my correlation analysis. I think I can save up 1.3TB space. But after many attempts I have not found a solution. I am going to look into Matrix_eQTL and I will let you know if it helps in my case. Thanks a lot for letting me know about your package. – user2698508 Jun 10 '14 at 05:14
  • Matrix eQTL would only be helpful if you want to record the information about the few significant associations. For all correlations I was talking about a different code. – Andrey Shabalin Jun 10 '14 at 05:17
  • Yeah I got that right. I am looking into getting the most significant ones with correlation and corresponding p values now as the many attempts I made with getting the full martix failed miserably. – user2698508 Jun 10 '14 at 05:31
  • If I am unable to arrive at a conclusion using only the significant ones and if it is compulsory to get the full matrix to understand the picture on a global scale, I would request you for the other code. – user2698508 Jun 10 '14 at 05:35