1

Is it possible to effectively obtain the norm of a sparse vector in python?

I tried the following:

from scipy import sparse
from numpy.linalg import norm

vector1 = sparse.csr_matrix([ 0 for i in xrange(4000000) ], dtype = float64)

#just to test I set a few points to a value higher than 0

vector1[ (0, 10) ] = 5
vector1[ (0, 1500) ] = 80
vector1[ (0, 2000000) ] = 6

n = norm(t1)

but then I get the error:

ValueError: dimension mismatch

The norm function only works with arrays so probably that's why the csr_matrix is not working, but then I didn't find another way of computing the norm effectively. One possible solution would be to compute:

norm(asarray(vector1.todense()))

but then it kills the purpose of using sparse vectors at first. And as the last approach I could iterate through each element of the vector and compute manually the norm, but since efficiency is really important I was searching for something faster and easier to implement.

Thanks in advance for any help!

EDIT: I tried everything that was suggested and the best solution is:

(vector1.data ** 2).sum()

from Dougal. But the Cython solution is also very good and works better as the vector grows in number of elements different of zero. Thanks everyone for the help!

Willian Fuks
  • 11,259
  • 10
  • 50
  • 74

3 Answers3

1
  1. I hope you are not really initializing and setting elements like that, those warnings are raised for a reason, and a 4M temporary list proofs you have plenty of resources left ;).
  2. Calculating a norm by hand is very simple, by just using the underlying data vector1.data directly. You can also use things like vector1.multiply(vector1) plus .sum or vector1.dot(vector1.T) but as Dougal pointed out, that can be much slower for this simple case.
  3. I guess you want to do more, but if you only want vector norms, going through sparse matrices seems like a lot of unnecessary work.
seberg
  • 8,785
  • 2
  • 31
  • 30
  • 3
    There doesn't seem to be a way to directly do elementwise exponentiation on sparse matrices, but `(vector1.data ** 2).sum()` is probably the most direct way to do the norm by hand. This works on at least CSR and CSC matrices; it's possibly that on other types of matrices it could break or just do the wrong thing, though, I'm not sure. – Danica Sep 24 '12 at 23:04
  • True, but using the dot product is just as good or multiply+sum. Though that silently converts the sparse matrix too in many cases. – seberg Sep 24 '12 at 23:07
  • The only reason I'd think that dotting or multiplying might be a little slower is that it then has to consider the indices, because it doesn't magically know that the elements are "lined up" perfectly. – Danica Sep 24 '12 at 23:17
  • 2
    It's actually a pretty huge difference: on a 4M elt vector with 150 random non-zeros, I got `6.56 ms` for `m.T.dot(m)`, `36.7 ms` for `m.multiply(m).sum()`, and `10.6 us` for `(m.data**2).sum()`. ([IPython session that tested it](https://gist.github.com/3779075)) – Danica Sep 24 '12 at 23:26
1

I just had the same problem here, I implemented a function in cython to increase the speed of this simple operation. I tested it with a 4M sparse vector of doubles with 100k non-zero elements. The method using sqrt(vector.multiply(vector).sum()) used 874us and my function 205us.

# sparseLib.pyx
#cython: boundscheck=False
from cython.parallel cimport prange
from cython.view cimport array as cvarray

import numpy as np

from libc.math cimport sqrt

cpdef double sparseNorm2(double [:] data) nogil:
  cdef long i
  cdef double value = 0.0
  for i in xrange(data.shape[0]):
    value += data[i]*data[i]
  return sqrt(value)
flaviotruzzi
  • 547
  • 4
  • 13
0

I don't think your initialization is doing what you think it is.

For norm to work, you need to have a square array. If you are trying to make a square array with 4million elements, you want to do

csr_matrix( (2000,2000), dtype=float64)

full documentation for initialization at scipy

BostonJohn
  • 2,631
  • 2
  • 26
  • 48
  • The problem is that I need to compute the norm of vectors, not matrices. Each element in the vector is related to how much a user likes or dislikes some product so that's why I need to use this structure. In short, it's the square root of the summation from each squared element in the vector. – Willian Fuks Sep 24 '12 at 18:13
  • 1
    numpy.linalg.norm does a matrix norm, not a vector norm. Is there a reason to nor just do `code sqrt( sum( i**2 for i in data))` ? – BostonJohn Sep 24 '12 at 18:40