A scalable way of representing and accessing sparse data in Python

Question

I have a sparse binary matrix represented with a file like this:

p_1|m_11
p_1|m_12
p_1|m_13
...
p_1|m_1,N1
p_2|m_21
p_2|m_22
...
p_2|m_2,N2
...
p_K|m_K1
...
p_K|m_K,NK

p's and m's comes from two respective sets. If there are K unique p's and L unique m's, the above represents a sparse K X L matrix with each row corresponding to a single 1 element of the matrix.

p's are integers; m's are alphanum strings

I need to have fast access to both individual elements of the matrix and its rows and columns. The current implementation shown below worked fine for small values of K (L is always about 50,000) but does not scale.

from scipy import sparse
from numpy import array
import numpy as np

# 1st pass: collect unique ps and ms
Ps = set()
Ms = set()
nnz = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        Ps.add(parts[0])
        Ms.add(parts[1])
        nnz += 1

Ps = list(Ps).sort()    # optional but prefer sorted
Ms = list(Ms).sort()    # optional but prefer sorted
K = len(Ps)
L = len(Ms)

# 2nd pass: create sparse mx
I = np.zeros(nnz)
J = np.zeros(nnz)
V = np.ones(nnz)

ix = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        I[ix] = Ps.index(parts[0])  # TAKES TOO LONG FOR LARGE K
        J[ix] = Ms.index(parts[1])
        ix += 1

data = sparse.coo_matrix((V,(I,J)),shape=(K,L)).tocsr()

There is gotta be a different way of doing this that scales better, but what is it?

I am not married to the sparse matrix format ( dict? ), I am willing to use any data structure that allows me fast access to individual elements, "rows" and "columns"

CLARIFICATION^{( I hope )}:
I am trying to move away from retrieving elements, rows and columns of my data using integer row/column values that get extracted by searching through two long arrays of strings.

Instead I just want to use actual ps and ms as keys, so instead of data[i,j] I want to use something like data[p_10,m_15]; and instead of data[i,:] use something like data[p_10,:].

I also need to be able to create data fast from my data file.

Again, data does not need to be a scipy or numpy sparse matrix.

What do you mean by *'takes too long'*? How long is too long? — Peter Wood, Nov 07 '15 at 15:34
@ Peter Wood "too long" is long enough to make it impractical for large `K`s. — I Z, Nov 07 '15 at 15:44
@ JR I am looking for an alternative data structure that would allow me to access rows and columns fast while also not having to use `list.index()` to populate the matrix — I Z, Nov 07 '15 at 15:46
@IZ May you kindly specify your order of preferences altogether with a quantitative scale for acceptable ranges for the solutions you seek? As an example: **For K x L ~ 10.000.000 acceptable assembly time is about NNN [ms], with a need to scale-out to handle K x L sizes not above ~10.000.000.000 still under MMM [ms]"** ? — user3666197, Nov 07 '15 at 15:56
@IZ the same applies to **quantitative specification in acceptable speed in [msec]** of the stated "***fast* access to { element | row | column }-s**". While it might seem to you as not important, it is vital to define these metrics for choosing the adequate approach - not only on initial data-representation assembly process, but for the whole life-cycle of the data-processing. One example - python **dict** ( as any other representation ) bears a certain overhead, normally not visible, which may destroy such approach from certain scale forwards. Numba acceleration on dict is prohibited at all. — user3666197, Nov 07 '15 at 16:06
As a brief quantitative view into **data-representation overheads** - both on instantiation and processing - on growing scales of the data-structure sizing and data-element sizes might be found in >>> http://stackoverflow.com/questions/33582663/python-lists-vs-arrays-speed — user3666197, Nov 07 '15 at 19:30

score 0 · Answer 1 · answered Nov 10 '15 at 14:30

I was able to speed up the 2nd pass below by simply creating two inverse indices:

from scipy import sparse
from numpy import array
import numpy as np

# 1st pass: collect unique ps and ms
Ps = set()
Ms = set()
nnz = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        Ps.add(parts[0])
        Ms.add(parts[1])
        nnz += 1

Ps = list(Ps).sort()    # optional but prefer sorted
Ms = list(Ms).sort()    # optional but prefer sorted
K = len(Ps)
L = len(Ms)

# create inverse indices for quick lookup
#
mapPs = dict()
for i in range(len(Ps)):
    mapPs[Ps[i]] = i

mapMs = dict()
for i in range(len(Ms)):
    mapMs[Ms[i]] = i

# 2nd pass: create sparse mx
I = np.zeros(nnz)
J = np.zeros(nnz)
V = np.ones(nnz)

ix = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        #I[ix] = Ps.index(parts[0]) # TAKES TOO LONG FOR LARGE K
        #J[ix] = Ms.index(parts[1]) # TAKES TOO LONG FOR LARGE K
        I[ix] = mapPs[parts[0]]
        J[ix] = mapMs[parts[1]]
        ix += 1

data = sparse.coo_matrix((V,(I,J)),shape=(K,L)).tocsr()

I did not have a chance to test it on a much larger dataset but on a smaller one I had problems with, execution time went from about 1 hour to about 10 seconds! So I am satisfied with this solution for now.

A scalable way of representing and accessing sparse data in Python

1 Answers1