How to write a consumer for a huge generator that does not leak memory?

Question

TL/DR: ThreadPoolExecutor was the reason. Memory usage with concurrent.futures.ThreadPoolExecutor in Python3

Here's a Python script (simplified a lot) that runs all-to-all routing algorithm, and in the process it eats up all memory.

I understand the problem is that the main function does not return, and the objects created inside it aren't cleaned up by garbage collector.

My main question: is it possible to write a consumer for the generator that returns, so that data gets cleaned up? Or should I just call garbage collector utility?

# thread pool executor like in python documentation example
def table_process(callable, total):
    with ThreadPoolExecutor(max_workers=threads) as e:
    future_map = {
        e.submit(callable, i): i
        for i in range(total)
    }

    for future in as_completed(future_map):
        if future.exception() is None:
            yield future.result()
        else:
            raise future.exception()

@argh.dispatch_command
def main():
    threads = 10
    data = pd.DataFrame(...)  # about 12K rows

    # this function routes only one slice of sources/destinations
    def _process_chunk(x:int) -> gpd.GeoDataFrame:
        # slicing is more complex, but simplified here for presentation
        # do cross-product and an http request to process the result
        result_df = _do_process(grid[x], grid)
        return result_df

    # writing to geopackage
    with fiona.open('/tmp/some_file.gpkg', 'w', driver='GPKG', schema=...) as f:
        for results_df in table_process(_process_chunk, len(data)):
            aggregated_df = results_df.groupby('...').aggregate({...})
            f.writerecords(aggregated_df)

"I understand the problem is that the main function does not return, and the objects created inside it aren't cleaned up by garbage collector." In CPython, objects will be *immediately* reclaimed when their ref-count goes to zero, not when a function returns. The `gc` only works with the cyclic garbage collector, and won't affect what you have necessarily, but you may have some reference cycles that are causing the leak, so you might as well try it. But without a reproducible example, one can only speculate — juanpa.arrivillaga, Dec 27 '18 at 21:32

score 0 · Accepted Answer · answered Dec 28 '18 at 18:57

0

Turned out it was ThreadPoolExecutor that keeps workers and does not release memory.

Solutions are here: Memory usage with concurrent.futures.ThreadPoolExecutor in Python3

answered Dec 28 '18 at 18:57

culebrón

34,265
20
72
110

How to write a consumer for a huge generator that does not leak memory?

1 Answers1