5

I have a sparse matrix A(equal to 10 * 3 in dense), such as:

print type(A)
<class scipy.sparse.csr.csr_matrix>

print A
(0, 0)  0.0160478743808
(0, 2)  0.0317314165078
(1, 2)  0.0156596521648
(1, 0)  0.0575683686558
(2, 2)  0.0107481166871
(3, 0)  0.0150580924929
(3, 2)  0.0297743235876
(4, 0)  0.0161931803955
(4, 2)  0.0320187296788
(5, 2)  0.0106034409766
(5, 0)  0.0128109177074
(6, 2)  0.0105766993238
(6, 0)  0.0127786088452
(7, 2)  0.00926522256063
(7, 0)  0.0111941023699

The max values for each column is:

print A.max(axis=0)
(0, 0)  0.0575683686558
(0, 2)  0.0320187296788

I would like to get the index corresponding to the column value. I know that the

A.getcol(i).tolist()
will return me a list of each column which allow me to use argmax() function, but this way is really slow. I am wondering is there any descent way to do?
KEXIN WANG
  • 123
  • 3
  • 13
  • Is your matrix able to fit in memory? Doing `A.todense().argmax(axis=0)` would do what you want as long as the `A.todense()` is possible. – kbrose Jul 11 '16 at 15:11
  • `argmax` would be a nice enhancement to the scipy sparse matrices. In the meantime: Can you switch to CSC format? If so, there is a way to get the argmax of the columns fairly efficiently. – Warren Weckesser Jul 11 '16 at 15:17
  • @kbrose, the .todense() not possible since the size of data doesn't fit the memory. – KEXIN WANG Jul 12 '16 at 15:22

2 Answers2

3

The more efficient way to get the max and argmax values in each matrix column is simply using scipy.sparse native functions:

  • max value of A in each matrix columns:

    max_values = A.max(axis=0)

  • max arg of A in each matrix column:

    max_args = A.argmax(axis=0)

The same to compute max values and arg max in each matrix row (using axis=1) or to compute max values and arg max of all the matrix (using axis=None).

Federico Caccia
  • 1,817
  • 1
  • 13
  • 33
1

This is a slight variation of the method you suggested in the question:

col_argmax = [A.getcol(i).A.argmax() for i in range(A.shape[1])]

(The .A attribute is equivalent to .toarray().)

A potentially more efficient alternative is

B = A.tocsc()
col_argmax = [B.indices[B.indptr[i] + B.data[B.indptr[i]:B.indptr[i+1]].argmax()] for i in range(len(B.indptr)-1)]

Either of the above will work, but I have to ask: if your array has shape (10, 3), why are you using a sparse matrix? (10, 3) is small! Just use a regular, dense numpy array.

Even if you keep A as a sparse matrix, the most efficient way to compute the argmax of the columns of a matrix that size is probably to just convert it to a dense array and use the argmax method:

col_argmax = A.A.argmax(axis=0)
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • Hi Warren, a lot of thanks for your answer! I test your solution and it faster then A.toarray() or .todense() function. – KEXIN WANG Jul 12 '16 at 10:06
  • The only problem is, your method doesn't work when one or sparse matrix column is empty( full by 0.). In this case just return a random number for me is fine, so i change a lite bit of your code to: def get_max(i): try: index = B.data[B.indptr[i]:B.indptr[i+1]].argmax() except: # the sum of column is zero # in other word, this test doc not has any word appear in train doc index = -1 return index maxval_index = [B.indices[B.indptr[i] + get_max(i)] for i in range(len(B.indptr)-1)] – KEXIN WANG Jul 12 '16 at 10:15
  • For you question why I choose sparse matrix, since my really size of A matrix is 100k * 300k and I would like to calculate the inner product of A with another large matrix. The CSR.dot function is pretty fast. This is the reason I choose sparse matrix. – KEXIN WANG Jul 12 '16 at 10:15
  • *"...since my really size of A matrix is 100k * 300k..."* OK, that's a good reason for using a sparse matrix. You should include that information in the question. – Warren Weckesser Jul 12 '16 at 14:21