21

Is there any way I can retain the original index of my large dataframe after I perform a groupby? The reason I need to this is because I need to do an inner merge back to my original df (after my groupby) to regain those lost columns. And the index value is the only 'unique' column to perform the merge back into. Does anyone know how I can achieve this?

My DataFrame is quite large. My groupby looks like this:

df.groupby(['col1', 'col2']).agg({'col3': 'count'}).reset_index()

This drops my original indexes from my original dataframe, which I want to keep.

jpp
  • 159,742
  • 34
  • 281
  • 339
Hana
  • 1,330
  • 4
  • 23
  • 38
  • When you group data, what index do you want each row to have? It's likely each group will combine many rows in the original dataframe.. Do you, for example, expect a list of indices relating to the group? – jpp Mar 11 '18 at 03:37
  • 1
    yes! that is what I'm looking for – Hana Mar 11 '18 at 03:40

4 Answers4

21

You can elevate your index to a column via reset_index. Then aggregate your index to a tuple via agg, together with your count aggregation.

Below is a minimal example.

import pandas as pd, numpy as np

df = pd.DataFrame(np.random.randint(0, 4, (50, 5)),
                  index=np.random.randint(0, 4, 50))

df = df.reset_index()

res = df.groupby([0, 1]).agg({2: 'count', 'index': tuple}).reset_index()

#     0  1  2            index
# 0   0  0  4     (2, 0, 0, 2)
# 1   0  1  4     (0, 3, 1, 1)
# 2   0  2  1             (1,)
# 3   0  3  1             (3,)
# 4   1  0  4     (1, 2, 1, 3)
# 5   1  1  2           (1, 3)
# 6   1  2  4     (2, 1, 2, 2)
# 7   1  3  1             (2,)
# 8   2  0  5  (0, 3, 0, 2, 2)
# 9   2  1  2           (0, 2)
# 10  2  2  5  (1, 1, 3, 3, 2)
# 11  2  3  2           (0, 1)
# 12  3  0  4     (0, 3, 3, 3)
# 13  3  1  4     (1, 3, 0, 1)
# 14  3  2  3        (3, 2, 1)
# 15  3  3  4     (3, 3, 2, 1)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 3
    As I understand the OPs question.. this is the correct answer. – mortysporty Aug 21 '19 at 06:57
  • 1
    Correct answer indeed1 – Gonzalo Garcia Sep 09 '19 at 17:38
  • `'index': tuple` also works for the second dictionary item. – Attila the Fun Jul 01 '22 at 19:23
  • 1
    For clarity: `.agg({'index': tuple})` works on the `'index'` column of `df` (this is a column after `df.reset_index()`), and put its *values*, got when performing the grouping, as a tuple in resulting `'index'` column of `res` DataFrame – PiWi Oct 25 '22 at 16:43
  • Depending on your DataFrame size, it may be worth using `set` instead of `tuple` since indices are unique anyway and their order usually doesn't matter (contrary to the example used in this answer) – YPOC Jul 26 '23 at 16:32
14

I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 2
    according to the comments, he wants to know which indices contributed to each group – jpp Mar 11 '18 at 04:24
  • This seems to be the most optimal solution out there as of Pandas 0.25.1 – DACW Aug 28 '19 at 19:17
  • is there a simple solution like this with performance similar to losing the index without transform? i'm using `groups.last()` and `groups.transform('last')` is much slower. pandas 1.3.4 – Jayen Oct 21 '21 at 10:55
  • @jayen I don't understand your question here. – Scott Boston Oct 21 '21 at 12:24
  • `groups.transform('last')` is slow. `groups.last()` is fast. is there a fast way to use groupby and keep the index? i managed to do it by copying the index to a column and then restoring it after the groupby. almost as fast as `groups.last()` – Jayen Oct 21 '21 at 14:02
0

You should not use 'reset_index()' if you want to keep your original indexes

manoj
  • 410
  • 2
  • 6
  • 3
    that doesn't work, even if the reset_index() is not there, the groupby does not retain the original indexes – Hana Mar 11 '18 at 03:35
  • 1
    you are correct, it won't solve the problem. My bad. Let me see if I can find any solution. – manoj Mar 11 '18 at 04:00
0

To actually get the index, you need to do

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('idxmin') # for first occurrence, idxmax for last occurrence

N.B if your agg column is a datetime, you may get dates instead of the integer index: reference. issue with older versions of pandas.

Workaround this by converting the datetime column to int

df['date_col'] = df['date_col'].apply(lambda x:x.toordinal())
Chidi
  • 901
  • 11
  • 15