I currently am trying to calculate a covariance matrix for a ~30k row matrix (all values are in range of [0,1]), and its taking a very long time (I have let it run for over and hour and it still hasn't completed).
One thing i noticed on smaller examples (a 7k row matrix) is that the values outputted have a ridiculous number of significant digits (e.g. ~10^32) and may be slowing things down (and increasing file size)--is there any way to limit this?
I've been using numpys covariance method on a simple dataframe:
import numpy as np
import pandas as pd
import sklearn as sk
df = pd.read_csv('gene_data/genetic_data25.csv')
df = df.set_index('ID_REF')
df = (df-df.min(axis = 0))/(df.max(axis = 0)-df.min(axis = 0))
cov = np.cov(df)
cov = pd.DataFrame(cov)
cov.to_csv('/gemnetics/cov_matrix.csv')