0

I am removing rows from a fairly large data frame using the following code.

try:
    df = df[~df['Full'].str.contains(myregex, regex=True, case=False)]
    return df

However, instead of decreasing the size of the data frame in memory on each iteration (large amounts of data are removed each time), the task manager shows increased memory utilization.

Before filtering starts, python uses ~4GB of memory but after the 22nd filtering event, it uses ~22GB of RAM.

Is there a way to remove matching entries from the data frame in a more efficient manner?

Edit: I use regex and contains. I can't change that

simplex123
  • 47
  • 2
  • 7
  • you could also try processing the file in chunks, e.g. https://stackoverflow.com/a/25962187/1358308. just accumulate the data you want and discard everything else – Sam Mason Aug 21 '19 at 11:17

2 Answers2

2

You could try calling gc.collect() after each filtering event. Normally a collection is triggered after a certain amount of allocations and de-allocations. But if you only perform a small number of huge de-allocations you might want to trigger it manually.

Python itself doesn't seem to release memory back to the OS. But numpy (on which pandas is based) does.

Also look over the rest of your code that you are not keeping references to the original dataframe or its columns somewhere else. Python will only de-allocate objects once their reference count reaches 0.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
0

Why do you use: try and return?

followed by this post: How to filter rows containing a string pattern from a Pandas dataframe

df = df[~df['Full'].str.contains(mystr)]

In the post are also other ways to filter your dataframe.

PV8
  • 5,799
  • 7
  • 43
  • 87
  • because I use it in a function and do some checks on the regex before. Also, please note the use of ```regex=True``` in my post – simplex123 Aug 21 '19 at 08:47