7

I need a fast element-wise maximum that compares each row of an n-by-m scipy sparse matrix element-wise to a sparse 1-by-m matrix. This works perfectly in Numpy using np.maximum(mat, vec) via Numpy's broadcasting.

However, Scipy's .maximum() does not have broadcasting. My matrix is large, so I cannot cast it to a numpy array.

My current workaround is to loop over the many rows of mat with mat[row,:].maximum(vec). This big loop is ruining my code efficiency (it has to be done many times). My slow solution is in the second code snippet below -- Is there a better solution?

# Example
import numpy as np
from scipy import sparse

mat = sparse.csc_matrix(np.arange(12).reshape((4,3)))

vec = sparse.csc_matrix([-1, 5, 100])

# Numpy's np.maximum() gives the **desired result** using broadcasting (but it can't handle sparse matrices):
numpy_result = np.maximum( mat.toarray(), vec.toarray() )
print( numpy_result )
# [[  0   5 100]
#  [  3   5 100]
#  [  6   7 100]
#  [  9  10 100]]

# Scipy only compares the top row of mat to vec (no broadcasting!):
scipy_result = mat.maximum(vec)
print( scipy_result.toarray() )
# [[  0   5 100]
#  [  3   4   5]
#  [  6   7   8]
#  [  9  10  11]]

#Reversing the order of mat and vec in the call to vec.maximum(mat) results in a single row output, and also frequently seg faults (!):

Larger example & current solution for speed testing

import numpy as np
from scipy import sparse
import timeit

mat = sparse.csc_matrix(  sparse.random(20000, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s))  )

vec = sparse.csc_matrix(  sparse.random(1, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s))  )

def sparse_elementwise_maximum(mat, vec):
    output = sparse.lil_matrix(mat.shape)
    for row_idx in range( mat.shape[0] ):
        output[row_idx] = mat[row_idx,:].maximum(vec)
    return output

# Time it
num_timing_loops = 3.0
starttime = timeit.default_timer()
for _ in range(int(num_timing_loops)):
    sparse_elementwise_maximum(mat, vec)
print('time per call is:', (timeit.default_timer() - starttime)/num_timing_loops, 'seconds')
# 15 seconds per call (way too slow!)

EDIT I'm accepting Max's answer, as the question was specifically about a high performance solution, and Max's solution offers huge 1000x-2500x speedups on various inputs I tried at the expense of more lines of code and Numba compiling. However, for general use, Daniel F's one-liner is a great solution offers 10x-50x speedups on examples I tried--I will probably use for many other things.

cataclysmic
  • 337
  • 2
  • 12
  • OK, weird, doing those with `csr_matrix` or `coo_matrix` crashes my kernel. `csc_matrix` works though. Can anyone replicate? – Daniel F Nov 17 '20 at 09:46
  • Yes -- Just edited to make them csc_matrix so the example can work for others. I have also had some crashes using csr_matrix with the .maximum() function (seg faults specifically). – cataclysmic Nov 17 '20 at 09:48
  • OK, might need some of the backend wizards to figure out what's happening in the compiled `scipy` code then. Have some patience, it might take a while for one of them to see this. – Daniel F Nov 17 '20 at 09:51
  • So, as far as I can tell every form of sparse matrix from `scipy` other than `csc_matrix` reverts to `csr_matrix` for doing `maximum`, so at least it's only one root cause. I can't follow the `cs_matrix`'s [`_maxumum_minimum_`](https://github.com/scipy/scipy/blob/v1.5.4/scipy/sparse/compressed.py#L547-L568) function though, or how `maximum` calls it. It seems like that `if dense_check(other):` call should raise a `ValueError` as it's an array. Maybe something's geting short-circuited there – Daniel F Nov 17 '20 at 10:50
  • `scipy/sparse/compressed.py` contains the code for methods like `maximum`, with `_binopt` at the core. The result is a sparse array with `self.shape`, built from `data`,`indptr` and `indices` arrays. The logic isn't simple. It makes most sense if the `other` is scalar, or matches `self` in shape. – hpaulj Nov 17 '20 at 16:38
  • With (20000, 4000) shape the `indptr` array for `csr` will be 20000 long, and 4000 for `csc`. For (1,4000) the `csr` will be more compact. Sometimes we get speed by iterating on the `indptr`, fetching the row (or column) data and indices directly. I've done that with one matrix, but haven't tried it with 2. – hpaulj Nov 17 '20 at 16:48
  • Is the transformation into a csr-matrix possible, or is the matrix in the real world example so huge that this causes problems? Is the sparisity (in your example 0.01) quite the same in your real world application? – max9111 Nov 18 '20 at 16:21
  • @max9111 yes -- I had them all as csr-matrices originally. I edited the code when Daniel F pointed out that he was getting errors (seg faults?), and I had them sporadically too -- for some reason changing to csc-matrices fixes this. But yes -- csr is totally fine. – cataclysmic Nov 18 '20 at 23:02
  • 1
    @max9111 Re: sparsity, it varies but reasonable values are 0.001 to 0.005 fraction of nonzero elements. – cataclysmic Nov 18 '20 at 23:09

3 Answers3

6

A low level approach

As always you can think on how a proper sparse matrix format for this operation is built up, for csr-matrices the main components are shape, data_arr,indices and ind_ptr. With these parts of the scipy.sparse.csr object it is quite straight forward but maybe a bit time consuming to implement an efficient algorithm in a compiled language (C,C++,Cython, Python-Numba). Int his implemenation I used Numba, but porting it to C++ should be easily possible (syntax changes) and maybe avoiding the slicing.

Implementation (first try)

import numpy as np
import numba as nb

# get all needed components of the csr object and create a resulting csr object at the end
def sparse_elementwise_maximum_wrap(mat,vec):
    mat_csr=mat.tocsr()
    vec_csr=vec.tocsr()

    shape_mat=mat_csr.shape
    indices_mat=mat_csr.indices
    indptr_mat=mat_csr.indptr
    data_mat=mat_csr.data
    indices_vec=vec_csr.indices
    data_vec=vec_csr.data

    res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec)
    res=sparse.csr_matrix(res, shape=shape_mat)
    return res

@nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data):
    data_res=[]
    indices_res=[]
    indptr_mat_res=[]

    indptr_mat_=0
    indptr_mat_res.append(indptr_mat_)

    for row_idx in range(shape_mat[0]):
        mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
        mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]

        mat_ptr=0
        vec_ptr=0
        while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
            ind_mat=mat_row_ind[mat_ptr]
            ind_vec=vec_row_ind[vec_ptr]

            #value for both matrix and vector is present
            if ind_mat==ind_vec:
                data_res.append(max(mat_row_data[mat_ptr],vec_row_data[vec_ptr]))
                indices_res.append(ind_mat)
                mat_ptr+=1
                vec_ptr+=1
                indptr_mat_+=1

            #only value for the matrix is present vector is assumed 0
            elif ind_mat<ind_vec:
                if mat_row_data[mat_ptr] >0:
                    data_res.append(mat_row_data[mat_ptr])
                    indices_res.append(ind_mat)
                    indptr_mat_+=1
                mat_ptr+=1

            #only value for the vector is present matrix is assumed 0
            else:
                if vec_row_data[vec_ptr] >0:
                    data_res.append(vec_row_data[vec_ptr])
                    indices_res.append(ind_vec)
                    indptr_mat_+=1
                vec_ptr+=1

        for i in range(mat_ptr,mat_row_ind.shape[0]):
            if mat_row_data[i] >0:
                data_res.append(mat_row_data[i])
                indices_res.append(mat_row_ind[i])
                indptr_mat_+=1
        for i in range(vec_ptr,vec_row_ind.shape[0]):
            if vec_row_data[i] >0:
                data_res.append(vec_row_data[i])
                indices_res.append(vec_row_ind[i])
                indptr_mat_+=1
        indptr_mat_res.append(indptr_mat_)

    return np.array(data_res),np.array(indices_res),np.array(indptr_mat_res)

Implementation (optimized)

In this approach the lists are replaced by a dynamically resized array. I increased the size of the output in 60 MB steps. On creation of the csr-object, there is also no copy of the data made, just references. If you want avoid a memory overhead you have to copy the arrays in the end.

@nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data):
    mem_step=5_000_000
    #preallocate memory for 5M non-zero elements (60 MB in this example)
    data_res=np.empty(mem_step,dtype=data_mat.dtype)
    indices_res=np.empty(mem_step,dtype=np.int32)
    data_res_p=0

    indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32)
    indptr_mat_res[0]=0
    indptr_mat_res_p=1
    indptr_mat_=0

    for row_idx in range(shape_mat[0]):
        mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
        mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]

        #check if resizing is necessary
        if data_res.shape[0]<data_res_p+shape_mat[1]:
            #add at least memory for another mem_step elements
            size_to_add=mem_step
            if shape_mat[1] >size_to_add:
                size_to_add=shape_mat[1]

            data_res_2   =np.empty(data_res.shape[0]   +size_to_add,data_res.dtype)
            indices_res_2=np.empty(indices_res.shape[0]+size_to_add,indices_res.dtype)
            for i in range(data_res_p):
                data_res_2[i]=data_res[i]
                indices_res_2[i]=indices_res[i]
            data_res=data_res_2
            indices_res=indices_res_2

        mat_ptr=0
        vec_ptr=0
        while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
            ind_mat=mat_row_ind[mat_ptr]
            ind_vec=vec_row_ind[vec_ptr]

            #value for both matrix and vector is present
            if ind_mat==ind_vec:
                data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr])
                indices_res[data_res_p]=ind_mat
                data_res_p+=1
                mat_ptr+=1
                vec_ptr+=1
                indptr_mat_+=1

            #only value for the matrix is present vector is assumed 0
            elif ind_mat<ind_vec:
                if mat_row_data[mat_ptr] >0:
                    data_res[data_res_p]=mat_row_data[mat_ptr]
                    indices_res[data_res_p]=ind_mat
                    data_res_p+=1
                    indptr_mat_+=1
                mat_ptr+=1

            #only value for the vector is present matrix is assumed 0
            else:
                if vec_row_data[vec_ptr] >0:
                    data_res[data_res_p]=vec_row_data[vec_ptr]
                    indices_res[data_res_p]=ind_vec
                    data_res_p+=1
                    indptr_mat_+=1
                vec_ptr+=1

        for i in range(mat_ptr,mat_row_ind.shape[0]):
            if mat_row_data[i] >0:
                data_res[data_res_p]=mat_row_data[i]
                indices_res[data_res_p]=mat_row_ind[i]
                data_res_p+=1
                indptr_mat_+=1
        for i in range(vec_ptr,vec_row_ind.shape[0]):
            if vec_row_data[i] >0:
                data_res[data_res_p]=vec_row_data[i]
                indices_res[data_res_p]=vec_row_ind[i]
                data_res_p+=1
                indptr_mat_+=1
        indptr_mat_res[indptr_mat_res_p]=indptr_mat_
        indptr_mat_res_p+=1

    return data_res[:data_res_p],indices_res[:data_res_p],indptr_mat_res

Maximum memory allocated in the beginning

The performance and usability of this approach heavily depends on the inputs. In this approach the maximal memory is allocated (this could easily cause out of memory errors).

@nb.njit(cache=True)
def sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,vec_row_ind,vec_row_data,shrink_to_fit):
    max_non_zero=shape_mat[0]*vec_row_data.shape[0]+data_mat.shape[0]
    data_res=np.empty(max_non_zero,dtype=data_mat.dtype)
    indices_res=np.empty(max_non_zero,dtype=np.int32)
    data_res_p=0

    indptr_mat_res=np.empty((shape_mat[0]+1),dtype=np.int32)
    indptr_mat_res[0]=0
    indptr_mat_res_p=1
    indptr_mat_=0

    for row_idx in range(shape_mat[0]):
        mat_row_ind=indices_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]
        mat_row_data=data_mat[indptr_mat[row_idx]:indptr_mat[row_idx+1]]

        mat_ptr=0
        vec_ptr=0
        while mat_ptr<mat_row_ind.shape[0] and vec_ptr<vec_row_ind.shape[0]:
            ind_mat=mat_row_ind[mat_ptr]
            ind_vec=vec_row_ind[vec_ptr]

            #value for both matrix and vector is present
            if ind_mat==ind_vec:
                data_res[data_res_p]=max(mat_row_data[mat_ptr],vec_row_data[vec_ptr])
                indices_res[data_res_p]=ind_mat
                data_res_p+=1
                mat_ptr+=1
                vec_ptr+=1
                indptr_mat_+=1

            #only value for the matrix is present vector is assumed 0
            elif ind_mat<ind_vec:
                if mat_row_data[mat_ptr] >0:
                    data_res[data_res_p]=mat_row_data[mat_ptr]
                    indices_res[data_res_p]=ind_mat
                    data_res_p+=1
                    indptr_mat_+=1
                mat_ptr+=1

            #only value for the vector is present matrix is assumed 0
            else:
                if vec_row_data[vec_ptr] >0:
                    data_res[data_res_p]=vec_row_data[vec_ptr]
                    indices_res[data_res_p]=ind_vec
                    data_res_p+=1
                    indptr_mat_+=1
                vec_ptr+=1

        for i in range(mat_ptr,mat_row_ind.shape[0]):
            if mat_row_data[i] >0:
                data_res[data_res_p]=mat_row_data[i]
                indices_res[data_res_p]=mat_row_ind[i]
                data_res_p+=1
                indptr_mat_+=1
        for i in range(vec_ptr,vec_row_ind.shape[0]):
            if vec_row_data[i] >0:
                data_res[data_res_p]=vec_row_data[i]
                indices_res[data_res_p]=vec_row_ind[i]
                data_res_p+=1
                indptr_mat_+=1
        indptr_mat_res[indptr_mat_res_p]=indptr_mat_
        indptr_mat_res_p+=1

    if shrink_to_fit==True:
        data_res=np.copy(data_res[:data_res_p])
        indices_res=np.copy(indices_res[:data_res_p])
    else:
        data_res=data_res[:data_res_p]
        indices_res=indices_res[:data_res_p]

    return data_res,indices_res,indptr_mat_res

# get all needed components of the csr object and create a resulting csr object at the end
def sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True):
    mat_csr=mat.tocsr()
    vec_csr=vec.tocsr()

    shape_mat=mat_csr.shape
    indices_mat=mat_csr.indices
    indptr_mat=mat_csr.indptr
    data_mat=mat_csr.data
    indices_vec=vec_csr.indices
    data_vec=vec_csr.data

    res=sparse_elementwise_maximum_nb(indices_mat,indptr_mat,data_mat,shape_mat,indices_vec,data_vec,shrink_to_fit)
    res=sparse.csr_matrix(res, shape=shape_mat)
    return res

Timings

Numba has a compilation overhead or some overhead to load the function from cache. Don't consider the first call if you want to get the runtime and not compilation+runtime.

import numpy as np
from scipy import sparse

mat = sparse.csr_matrix(  sparse.random(20000, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s))  )
vec = sparse.csr_matrix(  sparse.random(1, 4000, density=.01, data_rvs=lambda s: np.random.randint(0, 5000, size=s))  )

%timeit output=sparse_elementwise_maximum(mat, vec)
#for csc input
37.9 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#for csr input
10.7 s ± 90.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Daniel F
%timeit sparse_maximum(mat, vec)
164 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#low level implementation (first try)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
89.7 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#low level implementation (optimized, csr)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#low level implementation (preallocation, without copying at the end)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#low level implementation (preallocation, with copying at the end)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec)
16.5 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=False)
14.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res=sparse_elementwise_maximum_wrap(mat,vec,shrink_to_fit=True)
21.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#For comparison, copying the result takes
%%timeit
np.copy(res.data)
np.copy(res.indices)
np.copy(res.indptr)
7.8 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
max9111
  • 6,272
  • 1
  • 16
  • 33
  • This is awesome. On bigger inputs I am getting speedups of ~2000x. I tried to write this in Numba yesterday for the first time (have never used Numba before) and ended up with something much slower than the original -- clearly it was a matter of skill. Question: based on my very limited C++ knowledge, could I go even faster by multi-threading over rows of mat, as each row is independent? – cataclysmic Nov 20 '20 at 09:25
  • @cataclysmic Yes, this should be possible, but I think sophisticated C++ knowledge is necessary, to get it really efficient code. The main problem of the Numba implementation above is not the lack of parallelization, but collecting the results in Numba-Lists. Maybe I find some time in the evening to optimize this away (should give another factor of two or more) – max9111 Nov 20 '20 at 13:49
  • Amazing, but I think I must be doing something wrong---I can't replicate your timings. After compiling both the original and optimized versions, I get 0.0643sec for the original and 0.0488sec for the optimized, which is better but not the 5x speedup you're getting for the optimized vs. original. I see your code is calling sparse_elementwise_maximum_wrap(), but I'm seeing sparse_elementwise_maximum_**2**_wrap from the original version above -- did this function maybe change? Thanks again btw for these amazing implementations. – cataclysmic Nov 21 '20 at 07:36
  • Also I should have prob added to the original code a call to np.random.seed(12345) right before generating mat and vec so they're always the same for timing consistency. – cataclysmic Nov 21 '20 at 07:38
  • @cataclysmic The memstep was too small. This value should generally be adapded for a real world problem. The wrapper stays the same for both funcions. – max9111 Nov 21 '20 at 10:18
  • Ahh I see. I'm now able to get 3x+ speedup over the original, which was already blazing fast. If I understand, the block that begins *#check if resiting is necessary* just checks we don't run out of preallocated space. So I can understand, why not use e.g. the sum of the lengths of the inputs' sizes of mat + those for vec instead of memstep (worst case is the number of nonzero elements in the output is the sum of nonzero in mat + nonzero in vec)? Maybe this is too wasteful and needs extra len() calls? Thanks again -- I am learning a lot by looking at your code. – cataclysmic Nov 21 '20 at 11:24
  • @cataclysmic The worst case is nonzero elements of mat + nonzero elements of vec*mat.shape[0]. Depending on the sparisity structure this can be much larger than the actual result. You can try that, but I would recommend to check this for your real world example. In this case I would also definitly recommend a copy of data_res and indices_res at the end, to avoid persisting overallocation of large amounts of RAM. – max9111 Nov 21 '20 at 17:59
  • @cataclysmic I have added a version which allocates the maximum needed memory. – max9111 Nov 23 '20 at 19:19
  • Awesome. I'm learning a lot about Numba from reading your solutions -- totally different than my go-to numpy-esque approach to most problems. Thanks! – cataclysmic Nov 24 '20 at 09:09
  • Holy cow. Can't even be mad losing a checkmark to that answer. That's some inspired work. – Daniel F Dec 01 '20 at 08:32
  • @DanielF Thank you for the bounty. Maybe I will try a C version too, if I find some time. – max9111 Dec 03 '20 at 08:46
  • @max9111 I would *love* to see a C version. After seeing your Numba version's performance, it's clear I'll probably try writing the whole algorithm (where your answer here is the most challenging step step) in C at some point. – cataclysmic Dec 11 '20 at 05:57
3

scipy.sparse matrices don't broadcast. At all. So unless you can figure out some way to operate on the indices and inpts (I haven't), you're stuck stacking. Best I can figure out is just to vstack your vecs until they're the same shape as mat. It seems to give a good speedup, although it doesn't explain the segfault weirdness with csr.

#using `mat` and `vec` from the speed test
def sparse_maximum(mat, vec):
    vec1 = sparse.vstack([vec for _ in range(mat.shape[0])])
    return mat.maximum(vec1)

# Time it
num_timing_loops = 3.0
starttime = timeit.default_timer()
sparse_maximum(mat, vec)
print('time per call is:', (timeit.default_timer() - starttime)/num_timing_loops, 'seconds')
# I was getting 11-12 seconds on your original code
time per call is: 0.514533479333295 seconds

Proof that it works on original matrices:

vec = sparse.vstack([vec for _ in range(4)])

print(mat.maximum(vec).todense())
[[  0   5 100]
 [  3   5 100]
 [  6   7 100]
 [  9  10 100]]
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • This is giving my a 15x speedup on the example in my question, and more like 50x speedup on a bigger (50k X 10k) matrix. Thank you for posting! I agree that constructing a mat-sized vstack of vecs feels `wrong' to me (efficiency-wise), but perhaps there is no way around this. – cataclysmic Nov 18 '20 at 00:32
1

Looking at the maximum method, and code, especially the _binopt method in /scipy/sparse/compressed.py it's apparent that it can work with a scalar other. For a sparse other it constructs a new sparse matrix (of the same format and shape) using indptr, etc values. If other has the same shape, it works:

In [55]: mat = sparse.csr_matrix(np.arange(12).reshape((4,3)))
In [64]: mat.maximum(mat)
Out[64]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 11 stored elements in Compressed Sparse Row format>

It fails is the other is a 1d sparse matrix:

In [65]: mat.maximum(mat[0,:])
Segmentation fault (core dumped)

mat.maximum(mat[:,0]) runs without error, though I'm not sure about the values. mat[:,0] will have the same size indptr.

I thought the mat.maximum(mat[:,0]) would give same fault if mat was csc, but it doesn't.

Let's be honest, this kind of operation is not a strong point for sparse matrices. The core of its math is matrix multiplication. That's what they were originally developed for - sparse linear algebra problems such as finite difference and finite element.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks. I am using scipy.sparse matrices for matrix multiplication. This part of the code is obviously very fast (much faster than numpy arrays). The matrix that is the element-wise maximum of (mat, vec) is needed for one of the dot products, though. – cataclysmic Nov 18 '20 at 00:34
  • Do the nonzeros of `vec` add substantially to the nonzeros of the results? If `vec` has zeros in the same place as `mat` that `maximum` is potentially a lot simpler than if can turns 0s in `mat` to nonzero. Changes to the sparsity of a `csr/csc` is expensive, relativley speaking. – hpaulj Nov 18 '20 at 01:34
  • I just had the same idea! Yes, the zeros in vec are also zeros in mat (i.e. 0-columns), and I already have the indices of these zeros from a different operation. I wrote the numpy version, which is way faster than my previous numpy version. Still not sure how to write the optimal scipy version (I'm relatively new to scipy, so still figuring out how to do things efficiently) – cataclysmic Nov 18 '20 at 04:31
  • The numpy way is: `np.maximum( mat[:,nonzero_cols_booleanvec] , vec[nonzero_cols_boolean_vec] )' Right now I'm trying a version that does the slicing in scipy, then casts to numpy arrays (which makes mat and vec much narrower=less memory than dense versions of full mat & full vec because they're sparse and we only keep nonzeros after slicing), then uses np.maximum. There's probably something much better without this ugly casting, but I don't know it. – cataclysmic Nov 18 '20 at 04:34