9

I have a program which repeatedly loops over a pandas data frame like below:

monts = [some months]

for month in months:
  df = original_df[original_df.month == month].copy()
  result = some_function(df)
  print(result)

However, the memory which is required per iteration keeps to increase

                                           types |   # objects |   total size
================================================ | =========== | ============
             <class 'pandas.core.frame.DataFrame |          22 |      6.54 GB
               <class 'pandas.core.series.Series |        1198 |      4.72 GB
                           <class 'numpy.ndarray |        1707 |    648.19 MB
     <class 'pandas.core.categorical.Categorical |         238 |    368.90 MB
          <class 'pandas.core.indexes.base.Index |         256 |    312.03 MB

================================================ | =========== | ============
             <class 'pandas.core.frame.DataFrame |          30 |      9.04 GB
               <class 'pandas.core.series.Series |        2262 |      7.29 GB
                           <class 'numpy.ndarray |        2958 |    834.49 MB
     <class 'pandas.core.categorical.Categorical |         356 |    569.39 MB
          <class 'pandas.core.indexes.base.Index |         380 |    481.21 MB

do you have some suggestions how to find the memory leak?

edit

Note, manually calling gc.collect()on each iteration does not help.

edit 2

a minimal sample is here:

import pandas as pd
from numpy.random import randn
 df = pd.DataFrame(randn(10000,3),columns=list('ABC'))
for i in range(10):
    print(i)
    something = df.copy()
    print('#########################')
    print('trying to limit memory pressure')
    from pympler import muppy, summary
    all_objects = muppy.get_objects()
    sum1 = summary.summarize(all_objects)
    summary.print_(sum1)
    print('#########################')

As you see this is logging an increase in memory consumption. Starting with 9 MB at first after 10 iterations it is already using 30 MB.

edit 3

Actually, the comment from @Steven might have a point

for i in range(10):
    something = df.copy()
    foo_thing = summary.summarize(muppy.get_objects())
    summary.print_(foo_thing)

is showing the problem, whereas

for i in range(10):
    something = df.copy()
    summary.print_(summary.summarize(muppy.get_objects()))

is working fine. How could I find all of these variables which cause problems? I think this is especially important as in my real code some of these are some fairly large pandas.Dataframes.

edit 4

When manually adding a line of foo_thing = None the other script is working fine as well. The question remains - how to efficiently find all such cases. Shouldn't python identify the no longer used variable automatically?

edit 5

when introducing a function like:

def do_some_stuff():
    foo_thing = summary.summarize(muppy.get_objects())
    summary.print_(foo_thing)

for i in range(10):
    something = df.copy()
    do_some_stuff()

the memory leak also seems to be fixed.

edit 6

Actually, the memory leak is not fixed. The good thing is that the summary now no longer reports drastically increasing memory consumption. The bad thing is: the task manager/activity monitor tells me otherwise - and the python program is crashing at some point.

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • I am unsure if this is helpful: https://stackoverflow.com/questions/14224068/memory-leak-using-pandas-dataframe – Georg Heiler Jul 24 '17 at 18:06
  • do you see the same behavior if you put the loop contents inside a function? – Aaron Jul 24 '17 at 18:10
  • Maybe I could manually free all memory occupied by pandas and reload a fresh copy of `original_df` from disk? – Georg Heiler Jul 24 '17 at 18:10
  • 2
    Have you checked whether `some_function(df)` has a side effect of creating a persistent reference to `df` or is leaking in some other way? – Steven Rumbalski Jul 24 '17 at 18:12
  • you mean to a global variable? But shouldn't this be overwritten in each run of the function. I am not aware of such a case. – Georg Heiler Jul 24 '17 at 18:13
  • @StevenRumbalski please could you explain why only then the variables are properly dereferenced. – Georg Heiler Jul 24 '17 at 20:09
  • 1
    your memory leak is from checking for memory leaks see this [link](https://stackoverflow.com/questions/26554102/memory-leak-in-adding-list-values) assuming you've been profiling like you posted [here](https://gist.github.com/geoHeil/ae3c235595ff3adb3ad73407eab5ad53) – DJK Aug 08 '17 at 01:02
  • Confirmed. That is correct. – Georg Heiler Aug 08 '17 at 04:27
  • Still there is something more wrong with the real code. But for the minimal sample and let's say half of the memory leak it is correct what you posted. Please create an answer – Georg Heiler Aug 08 '17 at 04:29

3 Answers3

1

The problem is with scoping. When you create a new object in the loop, it is supposed to be accessible when the loop ends. This is why (I assume), the garbage collector doesn't mark the objects created using copy for garbage collection. When you create new objects inside a function, those objects are limited to the function scope and are NOT supposed to be available outside the function. That is why they are collected.

You mentioned that assigning foo_thing = None solves the problem. It does so because by pointing foo_thing to another object (None) there is no longer a variable that refers to the data frame. I use a similar approach but instead of foo_thing = None, I do del foo_thing. After all, Explicit is better than implicit.

Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170
0

I've used the minimal sample and modified it slightly, using tracker from Pympler to see the difference after executing a set of loops, but even after 10.000 loops, I can't see any memory leak.

This was tested with Python 3.6.0, Numpy 1.13.1 and Pandas 0.20.3.

So either the minimal sample you provided does not replicate the issue, or the issue is version dependent.

import pandas as pd
from numpy.random import randn
from pympler import tracker
from tqdm import tqdm_notebook


df = pd.DataFrame(randn(10000,3),columns=list('ABC'))

tr_initial = tracker.SummaryTracker()

for i in tqdm_notebook(range(10000)):
    something = df.copy()

tr_initial.print_diff()  

Output:

                                                 types |   # objects |   total size
====================================================== | =========== | ============
                                          <class 'dict |          78 |     28.73 KB
                                          <class 'list |          36 |      4.59 KB
                <class 'traitlets.config.loader.Config |          17 |      4.25 KB
                                         <class 'bytes |          22 |      2.65 KB
                                           <class 'str |           9 |    771     B
                                          <class 'cell |          15 |    720     B
                                         <class 'tuple |          11 |    704     B
                                   function (<lambda>) |           4 |    544     B
                                        <class 'method |           7 |    448     B
                                          <class 'code |           3 |    432     B
                      <class 'ipykernel.comm.comm.Comm |           7 |    392     B
  <class 'ipywidgets.widgets.widget.CallbackDispatcher |           3 |    168     B
       <class 'ipywidgets.widgets.widget_layout.Layout |           3 |    168     B
                                 function (store_info) |           1 |    136     B
                               function (null_wrapper) |           1 |    136     B
DocZerø
  • 8,037
  • 11
  • 38
  • 66
0

Instead of making copies, I would iterate over the groupby. Does that fix your problem?

for month, df in original_df.groupby('month'):
    result = some_function(df)
    print(result)
jondo
  • 21
  • 4