1

I currently am trying to calculate a covariance matrix for a ~30k row matrix (all values are in range of [0,1]), and its taking a very long time (I have let it run for over and hour and it still hasn't completed).

One thing i noticed on smaller examples (a 7k row matrix) is that the values outputted have a ridiculous number of significant digits (e.g. ~10^32) and may be slowing things down (and increasing file size)--is there any way to limit this?

I've been using numpys covariance method on a simple dataframe:

import numpy as np
import pandas as pd
import sklearn as sk

df = pd.read_csv('gene_data/genetic_data25.csv')

df = df.set_index('ID_REF')
df = (df-df.min(axis = 0))/(df.max(axis = 0)-df.min(axis = 0))

cov = np.cov(df)

cov = pd.DataFrame(cov)

cov.to_csv('/gemnetics/cov_matrix.csv')
DrTchocky
  • 515
  • 1
  • 5
  • 14
  • How many columns does your matrix have? The output of `cov` should be normalized, so those big numbers do not make sense. You should know if you have a rectangular matrix with one dimension much larger than the other one (rows in your case) then solving the cov matrix this way is inefficient because of null spaces. – anishtain4 Jul 26 '18 at 19:08
  • in this case, we have something on the order of ~10^3 columns, so it is not a square matrix. I hadn't thought about the null spaces, thanks for brining that up – DrTchocky Jul 26 '18 at 19:19
  • May I ask what is the next step? There might be a way to handle the problem with right eigenvalues instead of forming the whole covariance matrix. – anishtain4 Jul 26 '18 at 19:23
  • I'm trying a multitude of ways of trying to build directed acyclic graphs, using the covariance values as weights between nodes (e.g. rows in my matrix. The matrix is gene expression values in a series of experiments, and and i want to see if data can show which groups of genes interact with each other, and in what manner). I'm guessing eigenvalues could accomplish this as well. – DrTchocky Jul 26 '18 at 19:48
  • Seems like it should be pretty fast; 30k by 1k is not too large these days, although of course YMMV. Perhaps there is some unexpected overhead in the df class -- maybe try reading the csv into R or maybe Octave and compute the covariance there. Maybe try subsets of columns to see how the time required scales and therefore estimate time required for all 1k columns. In the bigger picture, I wonder if directed graphs are appropriate here, since covariance seems more naturally undirected. It sounds like you are looking for cliques in an undirected graph -- maybe there is some work on that already. – Robert Dodier Jul 26 '18 at 22:09

1 Answers1

1

Since I'm not familiar with genetics, I'll give you the general guidelines and hope it works. Let's assume you have your data in a matrix called X which is 30+k by 1k. You don't really need to normalize your data (unless it doesn't matter to you) but to calculate the covariance you have to center it. Then you can calculate the right eigenvectors:

Xp=X-X.mean(axis=0,keepdims=True)
k=Xp.T @ Xp
ev,R=np.linalg.eigh(k)
ev=ev[::-1]
R=R[:,::-1]

At this point you should look at the eigenvalues to see if there's any abrupt drop in them (this is Scree method), let's call this cut-off number n. If not, then you just have to choose which percent of the eigenvalues you want to keep. Next step would be reconstructing the left eigenvectors:

L=X @ R[:,:n]

Now R.T tells you which combination of eigenvectors are important and eigenvectors (L) are the most prominent combinations of your genes. I hope this helps.

anishtain4
  • 2,342
  • 2
  • 17
  • 21