3

I want to compute the transitive closure of a sparse matrix in Python. Currently I am using scipy sparse matrices.

The matrix power (**12 in my case) works well on very sparse matrices, no matter how large they are, but for directed not-so-sparse cases I would like to use a smarter algorithm.

I have found the Floyd-Warshall algorithm (German page has better pseudocode) in scipy.sparse.csgraph, which does a bit more than it should: there is no function only for Warshall's algorithm - that is one thing.

The main problem is that I can pass a sparse matrix to the function, but this is utterly senseless as the function will always return a dense matrix, because what should be 0 in the transitive closure is now a path of inf length and someone felt this needs to be stored explicitly.

So my question is: Is there any python module that allows computing the transitive closure of a sparse matrix and keeps it sparse?

I am not 100% sure that he works with the same matrices, but Gerald Penn shows impressive speed-ups in his comparison paper, which suggests that it is possible to solve the problem.


EDIT: As there were a number of confusions, I will point out the theoretical background:

I am looking for the transitive closure (not reflexive or symmetric).

I will make sure that my relation encoded in a boolean matrix has the properties that are required, i.e. symmetry or reflexivity.

I have two cases of the relation:

  1. reflexive
  2. reflexive and symmetric

enter image description here enter image description here

I want to apply the transitive closure on those two relations. This works perfectly well with matrix power (only that in certain cases it is too expensive):

>>> reflexive
matrix([[ True,  True, False,  True],
        [False,  True,  True, False],
        [False, False,  True, False],
        [False, False, False,  True]])
>>> reflexive**4
matrix([[ True,  True,  True,  True],
        [False,  True,  True, False],
        [False, False,  True, False],
        [False, False, False,  True]])
>>> reflexive_symmetric
matrix([[ True,  True, False,  True],
        [ True,  True,  True, False],
        [False,  True,  True, False],
        [ True, False, False,  True]])
>>> reflexive_symmetric**4
matrix([[ True,  True,  True,  True],
        [ True,  True,  True,  True],
        [ True,  True,  True,  True],
        [ True,  True,  True,  True]])

So in the first case, we get all the descendents of a node (including itself) and in the second, we get all the components, that is all the nodes that are in the same component.

enter image description here enter image description here

Radio Controlled
  • 825
  • 8
  • 23
  • Regarding the PS, I'd like to see some examples (in another question?). Sparse matrix multiplication with a dense array returns a dense array. For efficiency sake, `sparse.csr` does not change the sparsity indexing every time you change a value. Sometimes you have to run `eliminate_zeros` to clean it up. – hpaulj Dec 21 '17 at 17:57
  • Past examples: https://stackoverflow.com/questions/37674435/scipy-sparse-matrix-become-dense-matrix-after-assignment, https://stackoverflow.com/questions/43146968/python-sparse-matrix-remove-duplicate-indices-except-one – hpaulj Dec 21 '17 at 18:12
  • If multiplication is turning returning a dense matrix, try converting the `other` array into a sparse matrix first. `sparse*sparse` produces `sparse`. – hpaulj Dec 21 '17 at 18:13
  • `floyd_warshall` is in `sparse.csgraph.shortest_path.so`, a compiled module. – hpaulj Dec 21 '17 at 18:23
  • you are right, I just collected this under 'disappointments regarding scipy'... I'll do a new question for that. – Radio Controlled Jan 09 '18 at 14:27

1 Answers1

6

This was brought up on SciPy issue tracker. Problem is not so much the output format; the implementation of Floyd-Warshall is to begin with the matrix full of infinities and then insert finite values when a path is found. Sparsity is lost immediately.

The networkx library offers an alternative with its all_pairs_shortest_path_length. Its output is an iterator which returns tuples of the form

(source, dictionary of reachable targets) 

which takes a little work to convert to a SciPy sparse matrix (csr format is natural here). A complete example:

import numpy as np
import networkx as nx
import scipy.stats as stats
import scipy.sparse as sparse

A = sparse.random(6, 6, density=0.2, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
G = nx.DiGraph(A)       # directed because A need not be symmetric
paths = nx.all_pairs_shortest_path_length(G)
indices = []
indptr = [0]
for row in paths:
  reachable = [v for v in row[1] if row[1][v] > 0]
  indices.extend(reachable)
  indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = A + sparse.csr_matrix((data, indices, indptr), shape=A.shape)
print(A, "\n\n", A_trans)

The reason for adding A back is as follows. Networkx output includes paths of length 0, which would immediately fill the diagonal. We don't want that to happen (you wanted transitive closure, not reflexive-and-transitive closure). Hence the line reachable = [v for v in row[1] if row[1][v] > 0]. But then we don't get any diagonal entries at all, even where A had them (the 0-length empty path beats 1-length path formed by self-loop). So I add A back to the result. It now has entries 1 or 2 but only the fact they are nonzero is of significance.

An example of running the above (I pick 6 by 6 size for readability of the output). Original matrix:

  (0, 3)    1
  (3, 2)    1
  (4, 3)    1
  (5, 1)    1
  (5, 3)    1
  (5, 4)    1
  (5, 5)    1 

Transitive closure:

  (0, 2)    1
  (0, 3)    2
  (3, 2)    2
  (4, 2)    1
  (4, 3)    2
  (5, 1)    2
  (5, 2)    1
  (5, 3)    2
  (5, 4)    2
  (5, 5)    1

You can see that this worked correctly: the added entries are (0, 2), (4, 2), and (5, 2), all acquired via the path (3, 2).


By the way, networkx also has floyd_warshall method but its documentation says

This algorithm is most appropriate for dense graphs. The running time is O(n^3), and running space is O(n^2) where n is the number of nodes in G.

The output is dense again. I get the impression that this algorithm is just considered dense by nature. It seems the all_pairs_shortest_path_length is a kind of Dijkstra's algorithm.

Transitive and Reflexive

If instead of transitive closure (which is the smallest transitive relation containing the given one) you wanted transitive and reflexive closure (the smallest transitive and reflexive relation containing the given one) , the code simplifies as we no longer worry about 0-length paths.

for row in paths:
  indices.extend(row[1])
  indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = sparse.csr_matrix((data, indices, indptr), shape=A.shape)

Transitive, Reflexive, and Symmetric

This means finding the smallest equivalence relation containing the given one. Equivalently, dividing the vertices into connected components. For this you don't need to go to networkx, there is connected_components method of SciPy. Set directed=False there. Example:

import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import itertools

A = sparse.random(20, 20, density=0.02, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
components = sparse.csgraph.connected_components(A, directed=False)
nonzeros = []
for k in range(components[0]):
  idx = np.where(components[1] == k)[0]
  nonzeros.extend(itertools.product(idx, idx))
  row = tuple(r for r, c in nonzeros)
  col = tuple(c for r, c in nonzeros)
  data = np.ones_like(row)
B = sparse.coo_matrix((data, (row, col)), shape=A.shape)

This is what the output print(B.toarray()) looks like for a random example, 20 by 20:

[[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]]
  • I am somewhat worried about the overhead of creating a python dictionary as output but maybe there really is no C-based implementation that returns a sparse scipy matrix... – Radio Controlled Jan 09 '18 at 15:06
  • Perhaps I misunderstood the meaning of transitive closure. I actually want what you call "reflexive-and-transitive closure". – Radio Controlled Jan 09 '18 at 15:08
  • [node_connected_component(G,n)](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.node_connected_component.html#networkx.algorithms.components.node_connected_component) would also be an option, but I would love to have a solution that would work with the usual dependencies numy or scipy – Radio Controlled Jan 09 '18 at 15:56
  • Connected components also performs symmetric closure, as the relation "a and b are in the same component" is symmetric. Please make it clear, to yourself and to others, whether you seek transitive closure, transitive-and-reflexive closure, transitive-and-reflexive-and-symmetirc closure (so, an equivalence relation generated by the given one), or something else. –  Jan 09 '18 at 16:13
  • For connected components you don't need networkx, [SciPy sparse has them](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.connected_components.html#scipy.sparse.csgraph.connected_components) –  Jan 09 '18 at 16:32
  • Returns: n_components: int The number of connected components. labels: ndarray The length-N array of labels of the connected components. I could not understand what this output is useful for. I need to know if two nodes are in the same component. – Radio Controlled Jan 09 '18 at 16:46
  • This sounds like there are say 5 components, the labels for example 1,2,3,4,5. Ok, great... – Radio Controlled Jan 09 '18 at 16:48
  • 1
    Added a complete example for equivalence relation using connected_components. –  Jan 09 '18 at 22:26
  • This is great! I think you have to change the indentation in your example. Also, you can use zip to get rows and cols. – Radio Controlled Jan 11 '18 at 11:56
  • Now I realize at one point I also need to get for every node the set of nodes reachable (so if the input is directed, the ancestors are not reachable). Unfortunately the `directed=True` in `connected_components()` means something else. Do you have a suggestion? – Radio Controlled Jan 11 '18 at 11:59
  • as the dijkstra output is dense, I tried to use the `indices` argument to calculate the paths only for a few lines at a time, then compress to csr and append to the final solution, but it seems that the dijkstra function just does not consider the sparsity of the input as the memory consumption explodes.... – Radio Controlled Jan 11 '18 at 15:46
  • I already said that the concept of connected components is inherently symmetric: if a is in the same component as b, then of course b is in the same component as a. So if the relation should be asymmetric, components are the wrong tool. Anyway, I have given three answers and you still don't know what you actually want. I'm out. –  Jan 11 '18 at 20:57
  • I am very grateful for all the effort you put into this and explaining the components functionality + your code helped a lot. The assymetric problem is an additional task. Here, I am not talking about components anymore and I do not want to use networkx. Probably the answer is that this is not possible. – Radio Controlled Jan 15 '18 at 12:16