I have a sparse binary matrix represented with a file like this:
p_1|m_11
p_1|m_12
p_1|m_13
...
p_1|m_1,N1
p_2|m_21
p_2|m_22
...
p_2|m_2,N2
...
p_K|m_K1
...
p_K|m_K,NK
p
's and m
's comes from two respective sets. If there are K
unique p
's and L
unique m
's, the above represents a sparse K X L
matrix with each row corresponding to a single 1
element of the matrix.
p
's are integers; m
's are alphanum strings
I need to have fast access to both individual elements of the matrix and its rows and columns. The current implementation shown below worked fine for small values of K
(L
is always about 50,000
) but does not scale.
from scipy import sparse
from numpy import array
import numpy as np
# 1st pass: collect unique ps and ms
Ps = set()
Ms = set()
nnz = 0
with open('data.txt','r') as fin:
for line in fin:
parts = line.strip().split('|')
Ps.add(parts[0])
Ms.add(parts[1])
nnz += 1
Ps = list(Ps).sort() # optional but prefer sorted
Ms = list(Ms).sort() # optional but prefer sorted
K = len(Ps)
L = len(Ms)
# 2nd pass: create sparse mx
I = np.zeros(nnz)
J = np.zeros(nnz)
V = np.ones(nnz)
ix = 0
with open('data.txt','r') as fin:
for line in fin:
parts = line.strip().split('|')
I[ix] = Ps.index(parts[0]) # TAKES TOO LONG FOR LARGE K
J[ix] = Ms.index(parts[1])
ix += 1
data = sparse.coo_matrix((V,(I,J)),shape=(K,L)).tocsr()
There is gotta be a different way of doing this that scales better, but what is it?
I am not married to the sparse matrix format ( dict
? ), I am willing to use any data structure that allows me fast access to individual elements, "rows" and "columns"
CLARIFICATION( I hope ):
I am trying to move away from retrieving elements, rows and columns of my data using integer row/column values that get extracted by searching through two long arrays of strings.
Instead I just want to use actual p
s and m
s as keys, so instead of data[i,j]
I want to use something like data[p_10,m_15]
; and instead of data[i,:]
use something like data[p_10,:]
.
I also need to be able to create data
fast from my data file.
Again, data
does not need to be a scipy
or numpy
sparse matrix.