Multiprocessing in python for numerical simulation with large objects

Question

I am trying to put together a numerical simulation (specifically, Beta cell dynamics) based on Betram et al. 2007 (https://www.sciencedirect.com/science/article/pii/S0006349507709621). The model itself works fine but it is very slow since the simulation step must be around 0.1 ms and python is not the fastest language around. It takes approximately 12 real seconds for every simulation second with only 15 coupled beta cells in the system. In the end, I will need around 1000 beta cells to simulate an entire islet of Langerhans so you can see why I need to speed things up.

Each beta cell is implemented as a class instance which inherits from the CellParameters and ModelParameters class.

@jitclass(spec)
class BetaCell:
     def __init__(self, cell_num: int, neighbours: list, G: float):
          ##sets initial conditions (23 parameters - floats and lists)).

     def w_ijkl(self, ii, jj, kk, ll, f6p):
          ###calculates and returns a specific parameter

     def run_model_step(self, Ge: float):
          ###runs one time step (dt=0.1 ms) for the cell.
          ###has to calculate/update around 55 parameters

class ModelParameters:
    ###Contains all model parameters
    ###time step, the intensity of glucose stimulation, the start of stimulation etc.
    ###also contains when to save a time step for later visualization

    @staticmethod
    def external_glucose(time):
       ###calculates and returns the current level of external glucose
       ###uses a simple equation

class CellParameters:
    ###Contains approx. 70 parameters (floats) that the the model needs for execution.
    ###Some of these parameters are changed (once) after initialization
    ###to introduce some cell heterogeneity

The simulation looks like this:

some data is imported with cell parameters (locations, coupling, coupling weights).
each cell is initialized with its cell number (0, 1, 2, 3...), neighbours and starting glucose Cells are stored into a list named "cells".
if required, heterogeneity is introduced into cellular parameters
each step of the simulation is executed

Simulation step execution:

def run_step(cell):
    cell.run_model_step(glc)

if __name__ == '__main__':
    for step, current_time in enumerate(time):
        ###time array is pre-calculated based on provided end_time and simulation step (dt)
        glc = ModelParameters.external_glucose(current_time)
        cells = calculate_gj_coupling(cells) #calculates gap-jounction coupling between connected cells
        cells = list(map(run_step, cells))

The above for-loop is repeated until the end of the simulation is reached. Ofcourse this is a slow process taking around 10-12 seconds for each simulation second (10000 loop iterations @ 0.1 ms steps)

I really need to speed things up, preferably around 10-fold or more would be great.

Sofar I tried to use the Pool class from the multiprocessing module.

I created a pool: pool = Pool(processes=NUMBER_OF_WORKERS) I used the pools map function to run a simulation step for each cell

pool = Pool(processes=NUMBER_OF_WORKERS)
.
.
.
for step, current_time in enumerate(time):
    ###time array is pre-calculated based on provided end_time and simulation step (dt)
    glc = ModelParameters.external_glucose(current_time)
    cells = calculate_gj_coupling(cells) #calculates gap-jounction coupling between connected cells
    cells = pool.map(run_step, cells)

pool.terminate()

The rest is the same as before, because the slow part is the calculation of individual time steps for every beta cell.

The problem with the above solution is that it makes things worse. I am guessing that the shifting of the class instances around in memory for each process is the culprit, because the same solution worked wonders for a simplyfied version of the problem (below)

def task_function(dummy_object):
    dummy_object.sum_ab()
    return dummy_object

class DummyObject:
    def __init__(self, a, b):
       self.a = a
       self.b = b
       self.ab = 0.0

    def sum_ab(self):
       time.sleep(2) #simulates long running task
       self.ab += self.a + self.b

if __name__ == '__main__':
    pool = Pool(processes=NUMBER_OF_WORKERS)
    cells = [DummyObject(i, randint(1,20), randint(1,20)) for i in range(NUMBER_OF_CELLS)]
    for i in range(NUMBER_OF_STEPS):
        pool.map(task_function, cells)

    pool.terminate()

The above simple example speeds things up quite a bit. If sequential execution is implemented (the standard way) the "simulation" takes 400 seconds @ NUMBER_OF_CELLS=200 for one iteration of the for-loop (each cell takes 2 seconds * 200 = 400 s). If I implement the above solution one iteration of the for-loop takes only 8 seconds with NUMBER_OF_CELLS=200 and NUMBER_OF_WORKERS=60. But these DummyObjects are ofcourse very small and simple so the shifting around in memory goes quickly.

Any ideas to implement some version of the above dummy solution would be greatly appreciated.

EDIT 16. 2. 2023 Thanks to Fanchen Bao I have found the remaining bottleneck in my code. It is the coupling function that calculated coupling currents between connected cells.

The coupling function looks like this:

@jit(nopython=True)
def calculate_gj_coupling(cells, cells_neighbours):
    for i, cell in enumerate(cells):
        ca_current = 0.0
        voltage_current = 0.0
        g6p_current = 0.0
        adp_current = 0.0
        for neighbour, weight in cells_neighbours[i]:
            voltage_current += (cell.Cgjv*weight)*(cells[neighbour].V-cell.V)
            ca_current += (cell.Cgjca*weight)*(cells[neighbour].C-cell.C)
            g6p_current += (cell.Cgjg6p*weight)*(0.3*cells[neighbour].G6P-0.3*cell.G6P)
            adp_current += (cell.Cgjadp*weight)*(cells[neighbour].ADPm - cell.ADPm)
        cell.couplingV = voltage_current
        cell.couplingCa = ca_current
        cell.couplingG6P = g6p_current
        cell.couplingADP = adp_current
    return cells

It is basically a nested for-loop because each connection between two cells is weighted (weight parameter).

What would be a more pythonic (and faster) way of writing this up? Keep in mind that this function runs in every simulation step.

EDIT 18. 2 2023 I rewrote the BetaCell class. It now contains all cell parameters (instead of inheriting from the CellParameters class) and all necessary model parameters are provided at initialization (dt, save_step). This allowed me to add the Numba jitclass decorator with corresponding specifications. It threw an error before, because the appears to be a problem with inheritance during compilation, I guess. I also use Numba List() class instead of the Python built-in list.

Python is clearly not designed for high-performance numerically intensive applications. It is meant to run glue codes or scripts. Such glue code can run efficient vectorized functions. Numpy is a good example for that and it is certainly a good idea to use it if you plan not to use another language for this task. Moreover, the standard implementation of Python, CPython, has not been designed with a particular focus for multicore machines. The GIL prevent most multithreaded codes to scale and multiprocessing tends to be a pain and inefficient in many cases. — Jérôme Richard, Feb 15 '23 at 15:58
I strongly advise you to use a native language like C/C++ or eventually to try Cython. Using more core just wast more resource since interpreted pure-Python code (ie. without vectorized calls) are often 100-1000 times slower and does not scale well on many-core CPUs. Besides, note arrays of structure (AoS) tends not to be efficient anyway whatever the language. Using structure of array (SoA) is a more efficient approach. Mainstream OOP approaches typically results in inefficient code. For more information about this, please consider reading more about "SoA VS AoS" and "Data oriented design". — Jérôme Richard, Feb 15 '23 at 16:07
I don't think we have done a thorough investigation into the performance issue to warrant a switch to another language. My suggestion is to use a profiler to pinpoint where the longest running process is (I have had success with [Pyinstrument](https://pyinstrument.readthedocs.io/en/latest/home.html)). Without that knowledge, any optimization attempt is just a shot in the dark. If you can update your question with the result of the profiler, that would be very helpful. — Fanchen Bao, Feb 15 '23 at 18:46
@FanchenBao Thank you for the Pyinstrument suggestion. It really helped me. I have more or less given up on trying to use multiprocessing. But using Numba speeds things up very much as well. One iteration of the for-loop with 15 cells now takes only 3 seconds instead of 12 :) Thanks to Pyinstrument I have also pinpointed the remaining bottleneck. It is the coupling function which is pretty slow even when using Numba. Any ideas how to speed it up? I will post it in the original question. — Marko Šterk, Feb 16 '23 at 09:05
@JérômeRichard thank you for your comment. I am starting to agree with you. Python and multiprocessing as a pain in the a**. Do you have any suggestions on how to speed up the coupling function? I have posted it in the original question. — Marko Šterk, Feb 16 '23 at 09:07
You use `@jit(nopython=True)` but it looks like you do not use `jitclass`. If so, then Numba cannot compile this code and should report an error or a warning stating that it will fallback to a slow pure-Python code. Besides, as pointed out by FanchenBao, it is better to strip down OOP because it causes inefficient strided memory access patterns that prevent any SIMD optimizations (see my comment about SoA vs AoS above). Once fixed, you could try a double buffering approach so to run this step in parallel using multiple threads in Numba. — Jérôme Richard, Feb 18 '23 at 13:50
I have added the ```@jit(nopython=True)``` decorator to the coupling function and ```@jitclass(spec)``` on the BetaCell class (I edited the original post to reflect this). These two combined gave a significant speed increase. The simulation now runs in a ratio of 1:9 (in seconds). 9 real seconds for 1 simulation second at a dt=0.1 ms and 150 cells. Originally (uncompiled) the ratio is about 1:158. This is pretty decent for now. I will try to further speed up the simulation by improving the coupling function with the suggestions of @Fanchen Bao — Marko Šterk, Feb 18 '23 at 14:09

score 0 · Answer 1 · answered Feb 16 '23 at 22:29

This is not strictly a solution, but a suggestion on how the coupling function could be optimized.

Caveat

Without source data, the pseudocode below is not tested. The implementation may or may not function correctly, but the core idea shall be sound.

Core Idea

Look at this code snippet in particular

for neighbour, weight in cells_neighbours[i]:
    voltage_current += (cell.Cgjv*weight)*(cells[neighbour].V-cell.V)

If we write neighboring cells' weight as w1, w2, ..., wm, neighboring cells' voltage as V1, V2, ..., Vm, then

voltage_current
= cell.Cgjv * w1 * (V1 - cell.V) + cell.Cgjv * w2 * (V2 - cell.V) + ... + cell.Cgjv * wm * (Vm - cell.V)
= cell.Cgjv * (w1 * V1 + w2 * V2 + ... + wm * Vm - cell.V * (w1 + w2 + ... + wm))
= cell.Cgjv * (np.dot(W, V) - cell.V * np.sum(W))

Here W and V refers to vectors [w1, w2, ..., wm] and [V1, V2, ..., Vm] (represented in Python as numpy arrays). We can leverage numpy's vectorization to speed up the inner loop.

The same logic applies to the other three computations, as they follow the same logic. This optimization requires that all V, C, G6P and ADPm be stored as its own numpy array outside the cell object. We are basically striping down OOP to favor speed. It might make the code base a bit harder to maintain, but it surely will get a performance boost.

Optimized Pseudocode

import numpy as np


def calculate_gj_coupling(
    cells: List[Cell],
    cells_neighbours_indices: np.ndarray,
    cells_neighbours_weights: np.ndarray,
    cells_attrb: Dict[str, np.ndarray],
) -> None:
    """Calculate GJ coupling.

    Notice that we don't have to return anything, because the update to cells
    are done in-place

    :param cells: a list of cells.
    :type cells: List[Cell]
    :param cells_neighbours_indices: index of the outer list is the index to
        each cell in cells. Value of the inner list is the index of the
        neighboring cell.
        e.g. given cells_neighbours_indices = [
            [1, 2, 3],
            [0, 3],
            [0, 3],
            [0, 1, 2],
        ]
        We say
        cells[0] is neighbouring with cells[1], cells[2], and cells[3].
        cells[1] is neighbouring cells[0], cells[3]
        cells[2] is neighbouring cells[0], cells[3]
        cells[3] is neighbouring cells[0], cells[1], cells[2]
    :type cells_neighbours_indices: np.ndarray
    :param cells_neighbours_weights: index of the outer list is the index to
        each cell in cells. Value of the inner list is the weight of the
        neighboring cell.
        e.g. given cells_neighbours_weights = [
            [1, 2, 3],
            [4, 5],
            [6, 7],
            [8, 9, 10],
        ]
        We say
        cells[0]'s three neighbors have weights [1, 2, 3]
        cells[1]'s two neighbors have weights [4, 5]
        cells[2]'s three neighbors have weights [6, 7]
        cells[3]'s three neighbors have weights [8, 9, 10]
    :type cells_neighbours_weights: np.ndarray
    :param cells_attrib: a central place to store the attributes of all cells.
        It has the following shape
        {
            'V': np.array([v0, v1, ..., vn]),  # all cells' V
            'C': np.array([c0, c1, ..., cn]),  # all cells' C
            'G6P': np.array([g0, g1, ..., gn]),  # all cells' G6P
            'ADPm': np.array([a0, a1, ..., an]),  # all cells' ADPm
        }
    """
    for i, cell in enumerate(cells):
        idx = cells_neighbours_indices[i]
        ws = cells_neighbours_weights[i]
        sum_w = np.sum(ws)
        cell.couplingV = cell.Cgjv * (np.dot(ws, cells_attrb['V'][idx]) - cell.V * sum_w)
        cell.couplingCa = cell.Cgjca * (np.dot(ws, cells_attrb['C'][idx]) - cell.C * sum_w)
        cell.couplingG6P = cell.Cgjg6p * 0.3 * (np.dot(ws, cells_attrb['G6P'][idx]) - cell.G6P * sum_w)
        cell.couplingADP = cell.Cgjadp * (np.dot(ws, cells_attrb['ADPm'][idx]) - cell.ADPm * sum_w)

Can We Do Even Better?

Notice that the optimized pseudocode above only deals with the inner loop. There is not that much going on on the outer loop as well, so is it possible that we vectorize the outer loop as well? Let's take cell.couplingV as an example. What we want for calculate_gj_coupling to accomplish is the following:

couplingV0 = Cgjv0 * (np.dot(nei_W0, nei_V0) - V0 * np.sum(W0))
couplingV1 = Cgjv1 * (np.dot(nei_W1, nei_V1) - V1 * np.sum(W1))
.
.
.
couplingVn = Cgjvn * (np.dot(nei_Wn, nei_Vn) - Vn * np.sum(Wn))

where couplingV0, couplingV1, ..., couplingVn are the couplingV values of each cell, Cgjv0, Cgjv1, ..., Cgjvn are the Cgjv values of each cell, and V0, V1, ..., Vn are V values of each cell.

nei_W0, nei_W1, ..., nei_Wn are a list of vectors, each vector being a list of weights of a cell's neighboring cells. Similarly, nei_V0, nei_V1, ..., nei_Vn are a list of vectors, each vector being a list of V values of a cell's neighboring cells.

We can rewrite that as

where . is dot product of vectors and * is element-wise multiplication.

This equation tells us that if we use a vector to represent all cell's couplingV, we are able to vectorize the entire process of calculate_gj_coupling.

One more thing that needs addressing is the handling of nei_W and nei_V. We can turn them into matrices W_mat and V_mat. For example, W_mat[i][j] represent the weight of cells[j] when it neighbors cells[i]. If they do not neighbor, set W_mat[i][j] to zero.

Below is another piece of pseudocode to implement the full vectorization idea. Note that we strip down OOP even more. Also, W_mat, V_mat, C_mat, G6P_mat, and ADPm_mat each contains repeated data and is sparse. We are essentially sacrificing space for better time performance.

import numpy as np


def calculate_gj_coupling(
    W_mat: np.ndarray,
    V_mat: np.ndarray,
    C_mat: np.ndarray,
    G6P_mat: np.ndarray,
    ADPm_mat: np.ndarray,
    cells_attrb: Dict[str, np.ndarray],
) -> Dict[str, np.ndarray]:
    """_summary_

    :param W_mat: W_mat is a matrix of ALL weights of cells neighboring all
        other cells. e.g. W_mat[i][j] is the weight of cells[j] when it is
        neighboring cells[i]. If cells[i] and cells[j] do not neighbor,
        set W_mat[i][j] = 0. It has shape (n, n), where n is the total number
        of cells.
    :type W_mat: np.ndarray
    :param V_mat: V_mat is a matrix of ALL Vs of cells neighboring all other
        cells. e.g. V_mat[i][j] is the V of cells[j] when it is neighboring
        cells[i]. If cells[i] and cells[j] do not neighbor, set V_mat[i][j] = 0
        It has shape (n, n), where n is the total number of cells.
    :type V_mat: np.ndarray
    :param C_mat: C_mat is a matrix of ALL Cs of cells neighboring all other
        cells. e.g. C_mat[i][j] is the C of cells[j] when it is neighboring
        cells[i]. If cells[i] and cells[j] do not neighbor, set C_mat[i][j] = 0
        It has shape (n, n), where n is the total number of cells.
    :type C_mat: np.ndarray
    :param G6P_mat: G6P_mat is a matrix of ALL G6Ps of cells neighboring all
        other cells. e.g. G6P_mat[i][j] is the G6P of cells[j] when it is
        neighboring cells[i]. If cells[i] and cells[j] do not neighbor, set
        G6P_mat[i][j] = 0. It has shape (n, n), where n is the total number of
        cells.
    :type G6P_mat: np.ndarray
    :param ADPm_mat: ADPm_mat is a matrix of ALL ADPms of cells neighboring all
        other cells. e.g. ADPm_mat[i][j] is the ADPm of cells[j] when it is
        neighboring cells[i]. If cells[i] and cells[j] do not neighbor, set
        ADPm_mat[i][j] = 0. It has shape (n, n), where n is the total number of
        cells.
    :type ADPm_mat: np.ndarray
    :param cells_attrib: a central place to store the attributes of all cells.
        It has the following shape
        {
            'V': np.array([v0, v1, ..., vn]),  # all cells' V
            'C': np.array([c0, c1, ..., cn]),  # all cells' C
            'G6P': np.array([g0, g1, ..., gn]),  # all cells' G6P
            'ADPm': np.array([a0, a1, ..., an]),  # all cells' ADPm
            'Cgjv': np.array([cv0, cv1, ..., cvn]),  # all cells' Cgjv
            'Cgjca': np.array([cca0, cca1, ..., ccan]),  # all cells' Cgjca
            'Cgjg6p': np.array([cg6p0, cg6p1, ..., cg6pn]),  # all cells' Cgjg6p
            'Cgjadp': np.array([cadp0, cadp1, ..., cadpn]),  # all cells' Cgjadp
        }
    :return: a dictionary of the following shape
        {
            'couplingV': np.ndarray,
            'couplingCa': np.ndarray,
            'couplingG6P': np.ndarray,
            'couplingADP': np.ndarray,
        }
        Each array has length n, recording the coupling values of each cell.
    :rtype: Dict[str, np.ndarray]
    """
    sum_w = np.sum(W_mat, axis=1)  # a vector of weight sums for each cell
    # `*` is element-wise multiplication, `@` is matrix multiplication
    return {
        'couplingV': cells_attrb['Cgjv'] * (np.diag(W_mat @ V_mat) - cells_attrb['V'] * sum_w),
        'couplingCa': cells_attrb['Cgjvca'] * (np.diag(W_mat @ C_mat) - cells_attrb['C'] * sum_w),
        'couplingG6P': cells_attrb['Cgjg6p'] * 0.3 * (np.diag(W_mat @ G6P_mat) - cells_attrb['G6P'] * sum_w),
        'couplingADP': cells_attrb['Cgjadp'] * (np.diag(W_mat @ ADPm_mat) - cells_attrb['ADPm'] * sum_w),
    }

I did not expect such an answer to be honest. I can't say for sure, yet, but it appears to be spot on. However, there is the small issue of efficiently updating the matrices (V_mat, C_mat, etc) in every iteration or am I missing something? The values would have to be set in each time step in specific fields. — Marko Šterk, Feb 17 '23 at 17:47
If the cell graph change from iteration to iteration (i.e. in this iteration, cells[0] neighbors cells[1], but in the next iteration cells[0] neighbors cells[2] instead), the matrices need to be updated. If cell's attributes, e.g. V, C, G6P, etc., change from iteration to iteration, the matrices and the vectors in `cells_attrb` need to change. But the cost of those changes should be the same as the cost of change in your current set up. The optimization I proposed would indeed require you to re-design the majority of the program. — Fanchen Bao, Feb 17 '23 at 18:35
Updating matrix and vector using `numpy`'s method is significantly faster than for loop. In fact, I'd suggest you not using for loop as much as possible, as it is one of the biggest performance killer. — Fanchen Bao, Feb 17 '23 at 18:42
The values of V, C, G6P and ADP change for each cell in each simulation step. That is sort of the point of the simulation. So the corresponding matrices would also have to change in each step. But how can I achieve that in an efficient manner with only numpy methods? I don't know of any method that could do that. I know that for loops are a major performance killer, but sometimes they are easier to use :) — Marko Šterk, Feb 18 '23 at 08:39
Hmm... all the matrices need to be updated. That's not good. In that case, try the first optimization. It at least vectorizes one set of for loop, and updating the vectors in `cell_attrb` is much easier than updating matrices. Maybe the first optimization will be fast enough for you. — Fanchen Bao, Feb 18 '23 at 09:50

Multiprocessing in python for numerical simulation with large objects

1 Answers1

Caveat

Core Idea

Optimized Pseudocode

Can We Do Even Better?