1

Python concurrent.futures.ProcessPoolExecutor crashing with full RAM

Program description

Hi, I've got a computationally heavy function which I want to run in parallel. The function is a test that accepts as inputs:

  • a DataFrame to test on
  • parameters based on which the calculations will be ran.

The return value is a short list of calculation results.

I want to run the same function in a for loop with different parameters and the same input DataFrame, basically run a brute-force to find optimal parameters for my problem.

The code I've written

I currently am running the code concurrently with ProcessPoolExecutor from the module concurrent.futures.

import concurrent.futures
from itertools import repeat
import pandas as pd

from my_tests import func


parameters = [
    (arg1, arg2, arg3),
    (arg1, arg2, arg3),
    ...
]
large_df = pd.read_csv(csv_path)

with concurrent.futures.ProcessPoolExecutor() as executor:
    for future in executor.map(func, repeat(large_df.copy()), parameters):
        test_result = future.result()
        ...

The problem

The problem I face is that I need to run a large amount of iterations, but my program crashes almost instantly.

In on order for it not to crash, I need to limit it to max 4 workers, which is 1/4 of my CPU resources.

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    ...

I figured out my program crashes due to a full RAM (16 GB). What I found weird is that when I was running it on more workers, it was gradually eating more and more RAM, which it never released, until it crashed.

Instead of passing a copy of the DataFrame, I tried to pass the file path, but apart of slowing down my program, it didn't change anything.

Do you have any idea of why that problem occurs and how to solve it?

janboro
  • 79
  • 7
  • so you have identified that `my_tests.func` has a memory leak... we would need to know more about `func` in order to help. Aside from leaks, if the return data (`future.result()`) is significant in size, you'll need to make sure you're processing it and releasing it in the main process as well. – Aaron Nov 14 '22 at 15:28
  • If the leak is in a 3rd party library you must use inside of `func`, set the `max_tasks_per_child` parameter of your executor in order to periodically restart the worker processes. This adds overhead, but can force libraries to reload / clear memory – Aaron Nov 14 '22 at 15:32
  • If the problem is the return data (`executor.map` will wait for all results to be done before starting your loop), you should instead `submit` all your tasks, then call `concurrent.futures.as_completed` on all the `future` objects you collected from `submit`. This will allow the main process to handle the results as they are completed rather than waiting for them all to finish (which requires having enough memory to store all the results at once) – Aaron Nov 14 '22 at 15:40
  • The `func` function is running multiple calculations using numpy and pandas to calculate some values based on the initial dataframe. As to the `submit` and `as_completed` approach, it was my initial code, however the problem was the same. I will look into your suggestions and keep you updated. – janboro Nov 14 '22 at 16:11
  • The `map` method returns an iterator that when iterated directly returns the next result (i.e. the return value from `func`) and not a `Future` on which you must then call the `result` method. – Booboo Nov 15 '22 at 17:07

2 Answers2

1

See my comment on what map actually returns.

This answer is relevant according to how large your parameters list is, i.e. how many total tasks are being placed on the multiprocessing pool's task queue:

You are currently creating and passing a copy of your dataframe (with large_df.copy()) every time you are submitting a new task (one task for each element of parameters. One thing you can do is to initialize your pool processes once with a single copy per pool process that will be used by every task submitted and executed by the pool process. The assumption is that the dataframe itself is not modified by my_tests.func. If it is modified and you need a copy of the original large_df for each new task, the function worker (see below) can make the copy. In this case you would need 2 * N copies (instead of just N copies) to exist simultaneously where N is the number of processes in the pool. This will save you memory if the length of parameters is greater than that since in your code a copy of the dataframe will exist either on the task queue or in a pool process's address space.

If you are running under a platform such as Linux that uses the fork method to create new processes, then each child process will inherit a copy automatcally as a global variable:

import concurrent.futures
import pandas as pd

from my_tests import func


parameters = [
    (arg1, arg2, arg3),
    (arg1, arg2, arg3),
    ...
]

large_df = pd.read_csv(csv_path) # will be inherited

def worker(parameter):
    return func(large_df, parameter)
    """
    # or:
    return func(large_df.copy(), parameter)
    """

with concurrent.futures.ProcessPoolExecutor() as executor:
    for result in executor.map(worker, parameters):
        ...

my_tests.func is expecting as its first argument a dataframe, but with the above change the dataframe is no longer being passed; the dataframe is now accessed as a global variable. So without modifying func, we need am adapter function, worker, that will pass to func what it is expecting. Of course, if you are able to modify func, then you can do without the adapter.

If you were running on a platform such as Windows that uses the spawn method to create new processes, then:

import concurrent.futures
import pandas as pd

from my_tests import func

def init_pool_processes(df):
    global large_df
    large_df = df


def worker(parameter):
    return func(large_df, parameter)
    """
    # or:
    return func(large_df.copy(), parameter)
    """

if __name__ == '__main__':
    
    parameters = [
        (arg1, arg2, arg3),
        (arg1, arg2, arg3),
        ...
    ]
    
    large_df = pd.read_csv(csv_path) # will be inherited
    
    with concurrent.futures.ProcessPoolExecutor(initializer=init_pool_processes, initargs=(large_df,)) as executor:
        for result in executor.map(worker, parameters):
            ...
Booboo
  • 38,656
  • 3
  • 37
  • 60
0

Combining the suggestions of Aaron and Booboo, the solution to my problem was indeed copying a large DataFrame every single time I called the function, which made my computer run out of memory. The quick solution I found was to delete the copy of the DataFrame at the end of func:

def func(large_df_copy, parameters):
    ...
    del large_df_copy

I will look into modifying Booboo's code, as the DataFrame is indeed modified in the function in order to run calculations.

Thank you very much for helping me out, I appreciate it a lot!

janboro
  • 79
  • 7