I am trying to factorize very large matrixes with the python library Nimfa. Since the matrix is so large I am unable to instanciate it in a dence format in memory, so instead I use scipy.sparse.csr_matrix.
The library has a sparse matrix function that is called Snmf: Sparse Nonnegative Matrix Factorization (SNMF), which appears to be what I am looking for.
When trying it out I had serious performance issues with the factorization (not memory representation but in speed) I have not yet been able to factor a simple 10 x 95 matrix that is sparse.
This is how I build the test matrix:
m1 = lil_matrix((10, 95))
for i in xrange(10):
for j in xrange(95):
if random.random() > 0.8: m1[i, j] = 1
m1 = csc_matrix(m1)
and this is how I run it
t = time()
fctr = nimfa.mf(m1,
seed = "random_vcol",
rank = 2,
method = "snmf",
max_iter = 15,
initialize_only = True,
version = 'r',
eta = 1.,
beta = 1e-4,
i_conv = 10,
w_min_change = 0)
print numpy.shape(m1)
a = nimfa.mf_run(fctr)
print a.coef()
print a.basis()
print time() - t
This doesn't seem to finish at all. But if i do m1.todense() it finishes in seconds. Since I am unable to instanciate my real matrix this is not really a good solution for me.
I have tried different scipy.sparse matrix format but to no avail: csc_matrix, csr_matrix and dok_matrix.
Am I using wrong matrix format? What matrix operations does the snmf algorithm need to execute quickly? Is there some other mistake I am overlooking?