Writing long code as nested list comprehension

Question

I am writing a code IN Python to compute the discrete Laplacian as a sparse matrix in 2D. The code is as follows:

def discretizeLaplacian(N, step):
    NN = N**2
    dxx = (float) (1/((step)**2))
    i_loc, j_loc, vals = np.array([]), np.array([]), np.array([])
    for i in range(1, N-1):
        for j in range(1, N-1):      
            iv0 = j*N + i
            iv1 = (j - 1)*N + i
            iv2 = (j + 1)*N + i
            iv3 = j*N + i - 1
            iv4 = j*N + i + 1
            i_loc = np.concatenate((i_loc,[iv0, iv0, iv0, iv0, iv0]))
            j_loc = np.concatenate((j_loc,[iv0, iv1, iv2, iv3, iv4]))
            vals = np.concatenate((vals,dxx*np.array([-4, 1, 1, 1, 1])))
    sparseMatrix = csc_matrix((vals, (i_loc, j_loc)), shape = (NN, NN))
    return sparseMatrix

The for loop runs much longer compared to Matlab. I am wondering if I can write it as nested list comprehension to make it faster.

A list comprehension is just syntactic sugar to a normal loop. This shouldn't improve efficiency. — mozway, Dec 28 '21 at 06:56
Well, there are other posts saying that list comprehension in Python is much faster than normal loop as the way it accesses the memory. — Sean, Dec 28 '21 at 06:57
@mozway is correct. It will not run faster using a list comprehension. At one time it was the case. Newer versions of Python will optimize your code, and regular loops will run the same as one written with a list comprehension. — Mukherjee, Dec 28 '21 at 06:57
Would https://stackoverflow.com/a/66080877/51685 work for you? — AKX, Dec 28 '21 at 07:07
Without access to the corresponding Matlab code it's really hard to speculate what it's doing differently, but the typical answer is to use vectorized operations _instead of_ loops. — tripleee, Dec 28 '21 at 07:48
@mozway It *theoretically* shouldn't, but it *actually does improve efficiency*. See [this code](https://godbolt.org/z/5nofrrPTs) on GoldBolt. The comprehension list generate a significantly more efficient bytecode. In practice, the comprehension list version is more than 2 times faster that the plain loop on my machine with the last version of CPython. — Jérôme Richard, Dec 29 '21 at 13:53
@JérômeRichard A good chunk of that is repeated lookups of `lst.append`, which can be easily refactored by setting `f = lst.append` before the loop. The rest of the speed up is due to using the `LIST_APPEND` byte code instead of having to actually call a list method. So in some extremely narrow situations, the list comprehension can be faster, but real code usually has more complex cases that wash out that speed-up. — chepner, Dec 29 '21 at 14:35
I guess `j*N + I` is a typing mistake. Shouldn't it be `j*N + i`? — Jérôme Richard, Dec 29 '21 at 16:19
@chepner Indeed, but `LIST_APPEND` alone make the code 50% faster in this case. This is sad to have to manually write `f = lst.append` (less readable IMHO) to get good performance. I don't fully agree with the "*extremely* narrow situations" since comprehension lists are generally used in simple cases (where the operation tends to be not very expensive). The use-case of the OP is not so simple though. — Jérôme Richard, Dec 29 '21 at 16:51
It's also a trivial sample that's 4-10x slower than `list(range(100))`. — chepner, Dec 29 '21 at 16:58

score 2 · Accepted Answer · edited Dec 29 '21 at 16:24

The slow performance comes from the bad complexity of the algorithm. Indeed, the complexity of the original code is O(N**4)! This is due to np.concatenate which creates a new array by copying the old one and adding a few items at the end of the new one. This means that O(N**2) copies of 3 growing arrays are performed. In general, you should avoid np.concatenate in a loop to make a growing array. You should use Python lists in that case.

Note that you can use np.tile to repeat values of an array and pre-compute the constant dxx * np.array([-4, 1, 1, 1, 1]).

Here is the corrected code:

def discretizeLaplacian(N, step):
    NN = N**2
    dxx = (float) (1/((step)**2))
    tmp = dxx * np.array([-4, 1, 1, 1, 1])
    i_loc, j_loc = [], []
    for i in range(1, N-1):
        for j in range(1, N-1):   
            iv0 = j*N + I
            iv1 = (j - 1)*N + i
            iv2 = (j + 1)*N + i
            iv3 = j*N + i - 1
            iv4 = j*N + i + 1
            i_loc.extend([iv0, iv0, iv0, iv0, iv0])
            j_loc.extend([iv0, iv1, iv2, iv3, iv4])
    i_loc = np.array(i_loc, dtype=np.float64)
    j_loc = np.array(j_loc, dtype=np.float64)
    vals = np.tile(tmp, (N-2)**2)
    sparseMatrix = csc_matrix((vals, (i_loc, j_loc)), shape = (NN, NN))
    return sparseMatrix

Here are performance results on my machine with N=200:

Original code:   22.510 s
Corrected code:   0.068 s

While you can use list comprehensions to set i_loc with [j*N+I for i in range(1, N-1) for j in range(1, N-1)], it is not easy for j_loc due to the multiple items to append. You can build a list of list/tuple and then flatten it, but it makes the code less readable and not faster.

If this is not fast enough, you can use Numpy vectorized functions (such as np.meshgrid, np.arange, and basic math operations) to build i_loc and j_loc. Please consider reading the Numpy tutorials about that. Python loops are generally very slow since Python codes are generally interpreted. Matlab uses internally a just-in-time compiler (JIT) to execute such a loop-based code quickly. You can use Numba or Cython to do the same thing in Python.

Writing long code as nested list comprehension

1 Answers1