23

I'm doing some file parsing that is a CPU bound task. No matter how many files I throw at the process it uses no more than about 50MB of RAM. The task is parrallelisable, and I've set it up to use concurrent futures below to parse each file as a separate process:

    from concurrent import futures
    with futures.ProcessPoolExecutor(max_workers=6) as executor:
        # A dictionary which will contain a list the future info in the key, and the filename in the value
        jobs = {}

        # Loop through the files, and run the parse function for each file, sending the file-name to it.
        # The results of can come back in any order.
        for this_file in files_list:
            job = executor.submit(parse_function, this_file, **parser_variables)
            jobs[job] = this_file

        # Get the completed jobs whenever they are done
        for job in futures.as_completed(jobs):

            # Send the result of the file the job is based on (jobs[job]) and the job (job.result)
            results_list = job.result()
            this_file = jobs[job]

            # delete the result from the dict as we don't need to store it.
            del jobs[job]

            # post-processing (putting the results into a database)
            post_process(this_file, results_list)

The problem is that when I run this using futures, RAM usage rockets and before long I've run out and Python has crashed. This is probably in large part because the results from parse_function are several MB in size. Once the results have been through post_processing, the application has no further need of them. As you can see, I'm trying del jobs[job] to clear items out of jobs, but this has made no difference, memory usage remains unchanged, and seems to increase at the same rate.

I've also confirmed it's not because it's waiting on the post_process function by only using a single process, plus throwing in a time.sleep(1).

There's nothing in the futures docs about memory management, and while a brief search indicates it has come up before in real-world applications of futures (Clear memory in python loop and http://grokbase.com/t/python/python-list/1458ss5etz/real-world-use-of-concurrent-futures) - the answers don't translate to my use-case (they're all concerned with timeouts and the like).

So, how do you use Concurrent futures without running out of RAM? (Python 3.5)

Community
  • 1
  • 1
GIS-Jonathan
  • 4,347
  • 11
  • 31
  • 45

4 Answers4

20

I'll take a shot (Might be a wrong guess...)

You might need to submit your work bit by bit since on each submit you're making a copy of parser_variables which may end up chewing your RAM.

Here is working code with "<----" on the interesting parts

with futures.ProcessPoolExecutor(max_workers=6) as executor:
    # A dictionary which will contain a list the future info in the key, and the filename in the value
    jobs = {}

    # Loop through the files, and run the parse function for each file, sending the file-name to it.
    # The results of can come back in any order.
    files_left = len(files_list) #<----
    files_iter = iter(files_list) #<------

    while files_left:
        for this_file in files_iter:
            job = executor.submit(parse_function, this_file, **parser_variables)
            jobs[job] = this_file
            if len(jobs) > MAX_JOBS_IN_QUEUE:
                break #limit the job submission for now job

        # Get the completed jobs whenever they are done
        for job in futures.as_completed(jobs):

            files_left -= 1 #one down - many to go...   <---

            # Send the result of the file the job is based on (jobs[job]) and the job (job.result)
            results_list = job.result()
            this_file = jobs[job]

            # delete the result from the dict as we don't need to store it.
            del jobs[job]

            # post-processing (putting the results into a database)
            post_process(this_file, results_list)
            break; #give a chance to add more jobs <-----
GIS-Jonathan
  • 4,347
  • 11
  • 31
  • 45
Yoav Glazner
  • 7,936
  • 1
  • 19
  • 36
  • Excellent answer, thank you. That resolved it nicely with peak RAM usage spiking at about 140MB; I never considered the inputs as being the problem (you're right they're also very large). (That was after spending 20mins wondering why yours wasn't really multi-processing, you'd over-indented the `for job in...` line so it was a child of the `for this_file in...` (corrected now). *Note to the Python designers: Invisible characters for critical syntax is not a good idea!* – GIS-Jonathan Jan 13 '16 at 17:05
  • 2
    @GIS-Jonathan - In addition, [`futures.as_completed()`](https://github.com/python/cpython/blob/3.7/Lib/concurrent/futures/_base.py#L196), internally, makes a copy of the futures it is acting on. If `parse_function` could accept and return the filename `jobs` could be deleted immediately after the call to `as_completed` and garbage collection could dispense with it as soon as `as_completed` and its helpers have *de-referenced* it. That's the way it looks to me, not sure there is any actual improvement accept maybe keeping the future and its (file)name together through the whole process. – wwii Jul 07 '18 at 16:03
8

Try adding del to your code like this:

for job in futures.as_completed(jobs):
    del jobs[job]  # or `val = jobs.pop(job)`
    # del job  # or `job._result = None`
Asclepius
  • 57,944
  • 17
  • 167
  • 143
zack
  • 149
  • 2
  • 1
  • This worked for me, memory usage is once again stable. Looks like dereferencing each future upon completion is the key to memory management when using futures. I additionally do a `gc.collect()` afterwards to make sure. – BB1 Jan 19 '21 at 18:01
4

Looking at the concurrent.futures.as_completed() function, I learned it is enough to ensure there is no longer any reference to the future. If you dispense this reference as soon as you've got the result, you'll minimise memory usage.

I use a generator expression for storing my Future instances because everything I care about is already returned by the future in its result (basically, the status of the dispatched work). Other implementations use a dict for example like in your case, because you don't return the input filename as part of the thread workers result.

Using a generator expression means once the result is yielded, there is no longer any reference to the Future. Internally, as_completed() has already taken care of removing its own reference, after it yielded the completed Future to you.

futures = (executor.submit(thread_worker, work) for work in workload)

for future in concurrent.futures.as_completed(futures):
    output = future.result()
    ...  # on next loop iteration, garbage will be collected for the result data, too

Edit: Simplified from using a set and removing entries, to simply using a generator expression.

deed02392
  • 4,799
  • 2
  • 31
  • 48
  • 2
    A simpler solution would be to use a **generator** instead of a set. Then there is no need to remove anything. In other words, `futures = (executor.submit(thread_worker, work) for work in workload)` – Arel Jan 02 '22 at 21:33
  • 1
    This did it for me - around 2.4M work items queued up with "stable/fixed" memory pressure while its computing. Using the `ThreadPoolExecutor` – nover Feb 07 '22 at 09:59
3

Same problem for me.

In my case I need to start millions of threads. For python2, I would write a thread pool myself using a dict. But in python3 I encounted the following error when I del finished threads dynamically:

RuntimeError: dictionary changed size during iteration

So I have to use concurrent.futures, at first I coded like this:

from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
    all_resouces = get_all_resouces()
    with ThreadPoolExecutor(max_workers=50) as pool:
        for r in all_resouces:
            pool.submit(handle_resource, *args)

But soon memory exhausted, because memory will be released only after all threads finished. I need to del finished threads before to many thread started. So I read the docs here: https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures

Find that Executor.shutdown(wait=True) might be what I need. And this is my final solution:

from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
    all_resouces = get_all_resouces()
    i = 0
    while i < len(all_resouces):
        with ThreadPoolExecutor(max_workers=50) as pool:
            for r in all_resouces[i:i+1000]:
                pool.submit(handle_resource, *args)
            i += 1000

You can avoid having to call this method explicitly if you use the with statement, which will shutdown the Executor (waiting as if Executor.shutdown() were called with wait set to True).

Update:

A better solution just found:

futures: Set[Future] = set()
with ThreadPoolExecutor(max_workers) as thread_pool:
    for resouce in list/set/iterator/generator:
        if len(futures) >= 1000:
            """
            release a completed future when more than 1000 futures created, then submit(create) a new one.
            this will prevent memory exhausted when millions of futures needed
            """
            completed_future = next(as_completed(futures))
            futures.remove(completed_future)
        future = thread_pool.submit(resouce_handler, args)
        futures.add(future)
realcp1018
  • 399
  • 1
  • 3
  • 9
  • Despite using `ProcessPoolExecutor`, the point that _because memory will be released only after all threads finished._ actually is the key. I have 40K+ tasks to do and each takes about 2 MB, which...exploded my RAM – RaenonX Sep 21 '21 at 07:49