I have a program which repeatedly loops over a pandas data frame like below:
monts = [some months]
for month in months:
df = original_df[original_df.month == month].copy()
result = some_function(df)
print(result)
However, the memory which is required per iteration keeps to increase
types | # objects | total size
================================================ | =========== | ============
<class 'pandas.core.frame.DataFrame | 22 | 6.54 GB
<class 'pandas.core.series.Series | 1198 | 4.72 GB
<class 'numpy.ndarray | 1707 | 648.19 MB
<class 'pandas.core.categorical.Categorical | 238 | 368.90 MB
<class 'pandas.core.indexes.base.Index | 256 | 312.03 MB
================================================ | =========== | ============
<class 'pandas.core.frame.DataFrame | 30 | 9.04 GB
<class 'pandas.core.series.Series | 2262 | 7.29 GB
<class 'numpy.ndarray | 2958 | 834.49 MB
<class 'pandas.core.categorical.Categorical | 356 | 569.39 MB
<class 'pandas.core.indexes.base.Index | 380 | 481.21 MB
do you have some suggestions how to find the memory leak?
edit
Note, manually calling gc.collect()
on each iteration does not help.
edit 2
a minimal sample is here:
import pandas as pd
from numpy.random import randn
df = pd.DataFrame(randn(10000,3),columns=list('ABC'))
for i in range(10):
print(i)
something = df.copy()
print('#########################')
print('trying to limit memory pressure')
from pympler import muppy, summary
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
print('#########################')
As you see this is logging an increase in memory consumption. Starting with 9 MB at first after 10 iterations it is already using 30 MB.
edit 3
Actually, the comment from @Steven might have a point
for i in range(10):
something = df.copy()
foo_thing = summary.summarize(muppy.get_objects())
summary.print_(foo_thing)
is showing the problem, whereas
for i in range(10):
something = df.copy()
summary.print_(summary.summarize(muppy.get_objects()))
is working fine. How could I find all of these variables which cause problems? I think this is especially important as in my real code some of these are some fairly large pandas.Dataframes
.
edit 4
When manually adding a line of foo_thing = None
the other script is working fine as well. The question remains - how to efficiently find all such cases.
Shouldn't python identify the no longer used variable automatically?
edit 5
when introducing a function like:
def do_some_stuff():
foo_thing = summary.summarize(muppy.get_objects())
summary.print_(foo_thing)
for i in range(10):
something = df.copy()
do_some_stuff()
the memory leak also seems to be fixed.
edit 6
Actually, the memory leak is not fixed. The good thing is that the summary
now no longer reports drastically increasing memory consumption. The bad thing is: the task manager/activity monitor tells me otherwise - and the python program is crashing at some point.