Pandas: remove old DataFrame from memory after groupby

Question

  value  Group  something
0     a    1          1
1     b    1          2
2     c    1          4
3     c    2          9
4     b    2         10
5     x    2          5
6     d    2          3
7     e    3          5
8     d    2         10
9     a    3          5

I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?

df = df.groupby('Group').tail(3)

The result should look like the following:

  value  Group  something
0     a    1          1
1     b    1          2
2     c    1          4
5     x    2          5
6     d    2          3
7     e    3          5
8     d    2         10
9     a    3          5

N.B:- This question is related to Keeping the last N duplicates in pandas

Why do you not want to use the you use as an example (`df = df.groupby('Group').tail(3)`)? You can't do an inplace groupby as the grouped dataframe is a fundamentally different object. — johnpaton, Nov 26 '18 at 17:06
@johnpaton I edited the post a little bit. My goal is to ensure that I am keeping only the new df object in memory after assignment. — gibbz00, Nov 26 '18 at 17:10
@gibbz00 that happens with the current formulation as well. Python's garbage collection will take care of the old one once there are no more active references to it. — johnpaton, Nov 26 '18 at 17:13
@W-B Thank you. That answers the question. Can you kindly post that as an answer. — gibbz00, Nov 26 '18 at 17:13
@johnpaton Thank you. I did not know its automatically taken care of once all the active references are gone. Can you give an example of an active reference that will make the old df linger? — gibbz00, Nov 26 '18 at 17:15
@gibbz00 giving the output df a new name (`df_grouped = df.groupby('Group').tail(3)`) would mean that `df` still references the old dataframe, whereas `df_grouped` references the new one. Now they will both be stored in memory. — johnpaton, Nov 26 '18 at 17:16
better assign new column name like `df['new_col'] = df.groupby('Group').tail(3)` if you dnt want overwrite? — Karn Kumar, Nov 26 '18 at 17:17

score 1 · Accepted Answer · answered Nov 26 '18 at 17:14

1

df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.

answered Nov 26 '18 at 17:14

johnpaton

715
5
12

piRSquared · Answer 2 · 2018-11-26T17:23:18.130

1

Trying way too hard to guess what you want.

NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.

from collections import defaultdict

def f(s):
  c = defaultdict(int)
  for i, x in zip(s.index[::-1], s.values[::-1]):
    c[x] += 1
    if c[x] > 3:
      yield i

df.drop([*f(df.Group)], inplace=True)
df

  value  Group  something
0     a      1          1
1     b      1          2
2     c      1          4
5     x      2          5
6     d      2          3
7     e      3          5
8     d      2         10
9     a      3          5

edited Nov 26 '18 at 17:23

answered Nov 26 '18 at 17:19

piRSquared

285,575
57
475
624

I was imagining a solution like this one as drop has a Inplace parameter. However, I did not know `df = df.groupby('Group').tail(3)` already ensures that the old df is released from memory once overwritten. – gibbz00 Nov 26 '18 at 17:22
1

Yeah, if you aren't concerned with the temporary memory being consumed then released, then you should absolutely use `df.groupby('Group').tail(3)`. You didn't mention performance so I assume it isn't an issue. – piRSquared Nov 26 '18 at 17:24
What is the asterisk(*) doing in [*f(df.Group)] ? – gibbz00 Nov 26 '18 at 17:29
1

`*` in that context is argument unpacking the interable. `[*f(df.Group)]` is a fancy way of doing this `list(f(df.Group))` – piRSquared Nov 26 '18 at 17:31

score 1 · Answer 3 · answered Nov 26 '18 at 17:30

Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:

df['new_col'] = df.groupby('Group').tail(3)

However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:

df[-2:]   #  last 2 rows

Pandas: remove old DataFrame from memory after groupby

3 Answers3