3

The following code runs in 45s when using pure Python.

for iteration in range(maxiter):
    for node in range(n):
        for dest in adjacency_list[node]:
            rs[iteration + 1][dest] += beta * rs[iteration][node] / len(adjacency_list[node])

But, by simply initializing rs as a numpy ndarray instead of a python list of lists, the code runs in 145s. I do not really know why numpy takes 3 times as much time with this array indexing.

My idea was to vectorize as much things as possible, but have only managed to vectorize the multiplication of beta/len(adjacency_list[node]). This code runs in 77s.

beta_over_out_degree = np.array([beta / len(al) for al in adjacency_list])
for iteration in range(1, maxiter + 1):
    r_next = np.full(shape=n, fill_value=(1 - beta) / n)
    f = beta_over_out_degree * r
    for i in range(n):
        r_next[adjacency_list[i]] += f[i]

    r = np.copy(r_next)
    rs[iteration] = np.copy(r)

The problem is that adjacency_list is a list of lists with differing column size, with 100 000 rows and 1-15 columns. A more standard approach with an adjacency matrix, at least as a normal ndarray, is not an option, since for n=100 000 its shape of (n,n) is too big to be allocated to memory.

Is there any way to vectorize using its indexes for numpy advanced indexing(maybe turning it into a numpy ndarray)?

I would also greatly appreciate any other speed tips. Thanks in advance!

EDIT: Thanks to @stevemo I managed to create adjacency_matrix with csr_matrix functionality and used it for iterative multiplication. Program now runs in only 2s!

for iteration in range(1, 101):
    rs[iteration] += rs[iteration - 1] * adjacency_matrix
sukisule
  • 33
  • 4
  • Can you share the dimensions of the arrays? Perhaps create a dummy example? – yatu May 20 '20 at 19:42
  • Do you need the values at all the iterations, or just the final one? – Igor Rivin May 20 '20 at 19:49
  • If you mean the values of adjacency_list, since it is used in the innermost loop, I need them at all iterations. – sukisule May 20 '20 at 19:57
  • with n=100000, shapes are: r->(1,n),rs(100,n),adjacency_list->(n,between 1 and 15) – sukisule May 20 '20 at 20:00
  • Indexing individual elements of an array is slower. – hpaulj May 20 '20 at 20:14
  • With a question like this we need both a small demonstration case (to test alternatives against), and estimates of the real world problem size. The fact that you have lists of varying size makes it much harder to find a fast `numpy` solution (which is optimized for 'rectangular' multidimensional arrays). – hpaulj May 20 '20 at 20:30
  • To be honest I dont know a simple way to give you a demonstration without giving a full dataset/text file and full script. – sukisule May 20 '20 at 20:37

1 Answers1

1

If I understand you correctly, this can be done with a one-liner formula using matrix powers of the adjacency matrix.

Based on your original code snippet, it seems you have some network of n nodes, with the adjacency information stored as a list of lists in adjacency, and you have a value r associated with each node such its value at iteration k+1 is beta times the sum of r of each of its neighbors at iter k. (Your loop constructs this in the opposite direction, but same thing.)

If you don't mind reforming your adjacency list-of-lists into a more standard adjacency matrix, such that A_ij = 1 if ij are neighbors, 0 otherwise, then you could accomplish the inner two loops with a simple matrix product, r[k+1] = beta * (A @ r[k]).

And following this logic, r[k+2] = beta * (A @ (beta * (A @ r[k]))) = (beta * A)**2 @ r[k] or in general,

r[k] = (beta * A)**k @ r[0]

Let's try this on a small network:

# adjacency matrix
A = np.array([
    [0, 1, 1, 0, 0],
    [1, 0, 1, 0, 0],
    [1, 1, 0, 1, 0],
    [0, 0, 1, 0, 1],
    [0, 0, 0, 1, 0]
])

# initial values
n = 5
beta = 0.5
r0 = np.ones(n)
maxiter = 10

# after one iteration
print(beta * (A @ r0))
# [1.  1.  1.5 1.  0.5]

# after 10 iterations
print(np.linalg.matrix_power((beta * A), maxiter) @ r0)
# [2.88574219 2.88574219 3.4921875  1.99414062 0.89257812]
stevemo
  • 1,077
  • 6
  • 10
  • Thank you for a very detailed and helpful answer. I am a noob at stackoverflow, so I forgot yet another detail: adjacency matrix, at least as a normal ndarray, is not an option, since for n=100 000 its shape of (n,n) is too big to be allocated to memory. – sukisule May 21 '20 at 07:33
  • Scipy’s [sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html) should have no problem with this, and play well with NumPy. They can seem a bit intimidating for a new user, but I assure you it’s worth the short learning curve :) – stevemo May 21 '20 at 12:20