3

I am working on a Kedro 0.17.2 project that is running on out-of-memory issues and I'm trying to reduce the memory footprint.

I'm doing the profiling by using mprof from the memory-profiler library and I noticed that there is always a child process and memory seems to duplicate in the main process after the first computation in the node that is running. Is it possible that Kedro is duplicating the dataframes in memory? And, if so, is there a way to avoid this?

Notes:

  • I'm using the SequentialRunner
  • I'm not using the is_async cli option
  • I'm not using either multithreading or multiprocessing in the node execution

enter image description here

lspinheiro
  • 423
  • 1
  • 4
  • 9

2 Answers2

0

Hi @Ilspinheiro it's a little difficult to ascertain what's going on. In short, we do not expect Kedro to be duplicating memory out of the box, in theory this could be introduced by something in hooks.py.

Either way, I can help you reduce your memory footprint:

  1. Persist data more often, reduce your use of implicit MemoryDataSets.
  2. Understand the particular logic in your node, what are you doing in Pandas? Is there a vectorized way of doing what you're trying to do?
  3. Use CachedDataSet if you use the same datasets over and over.
  4. Break up your pipelines into smaller ones and run each part individually. Mostly to narrow down the problem area(s).
datajoely
  • 1,466
  • 10
  • 13
  • I recently noticed this behaviour is introduced by the kedro `mem_profile` decorator that is being used in a hook, like you mentioned, but I haven't been able to precisely determine why this happens. – lspinheiro Oct 19 '21 at 10:14
0

It turns out this issue is caused by a possible bug in the memory-profiler library that is used in the kedro.extras.decorators.memory_profiler.mem_profile decorator.

The kedro decorator makes use of the memory_usage function in the memory-profiler module. It is used to sample the total memory being used by the running function from within the python process.

There is an open issue about this problem but with no solution yet. https://github.com/pythonprofilers/memory_profiler/issues/332

For the moment I have just removed the decorator.

lspinheiro
  • 423
  • 1
  • 4
  • 9