pandas groupby dropping columns

Question

I'm doing a simple group by operation, trying to compare group means. As you can see below, I have selected specific columns from a larger dataframe, from which all missing values have been removed.

selected columns and df head

But when I group by, I am losing a couple of columns:

group-by logic and resulting df

I have never encountered this with pandas, and I'm not finding anything else on stack overflow that is all that similar. Does anybody have any insight?

score 42 · Accepted Answer · edited Apr 13 '21 at 18:49

42

I think it is Automatic exclusion of 'nuisance' columns, what described here.

Sample:

df = pd.DataFrame({'C': {0: -0.91985400000000006, 1: -0.042379, 2: 1.2476419999999999, 3: -0.00992, 4: 0.290213, 5: 0.49576700000000001, 6: 0.36294899999999997, 7: 1.548106}, 'A': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'bar', 4: 'foo', 5: 'bar', 6: 'foo', 7: 'foo'}, 'B': {0: 'one', 1: 'one', 2: 'two', 3: 'three', 4: 'two', 5: 'two', 6: 'one', 7: 'three'}, 'D': {0: -1.131345, 1: -0.089328999999999992, 2: 0.33786300000000002, 3: -0.94586700000000001, 4: -0.93213199999999996, 5: 1.9560299999999999, 6: 0.017587000000000002, 7: -0.016691999999999999}})

print (df)
     A      B         C         D
0  foo    one -0.919854 -1.131345
1  bar    one -0.042379 -0.089329
2  foo    two  1.247642  0.337863
3  bar  three -0.009920 -0.945867
4  foo    two  0.290213 -0.932132
5  bar    two  0.495767  1.956030
6  foo    one  0.362949  0.017587
7  foo  three  1.548106 -0.016692

print( df.groupby('A').mean())
            C         D
A                      
bar  0.147823  0.306945
foo  0.505811 -0.344944

I think you can check DataFrame.dtypes.

edited Apr 13 '21 at 18:49

EliadL

6,230
2
26
43

answered Jun 01 '16 at 18:15

jezrael

822,522
95
1,334
1,252

@jezraiel. Is there a way to stop this? So the old headers are still up top? – spacedustpi Jul 27 '18 at 13:50
3

@spacedustpi - Do you think `df.groupby('A', as_index=False).mean()` or `df.groupby('A').mean().reset_index()` ? – jezrael Jul 27 '18 at 13:52
This is mycode: df1 = df.groupby(['ORGNTR_NM', 'ORGNTR_BNK_NM', 'BNFCRY_BNK_NM', 'BNFCRY_NM'], as_index=False)['TRNSXN_AMT'].agg(['sum', 'count']). The result is that these four original columns are no longer columns, while 'sum' and 'count' are, and they are above the old columns. I would like them all to be headers. – spacedustpi Jul 27 '18 at 14:00
@spacedustpi - How working `df1 = df.groupby(['ORGNTR_NM', 'ORGNTR_BNK_NM', 'BNFCRY_BNK_NM', 'BNFCRY_NM'], as_index=False)['TRNSXN_AMT'].agg(['sum', 'count']).reset_index()` ? – jezrael Jul 27 '18 at 14:01
Please see original comment, I edited it. Sorry about that. – spacedustpi Jul 27 '18 at 14:03
1

That did it! Thanks. – spacedustpi Jul 27 '18 at 14:10
I'm finding this issue when I run `df.groupby(mycols, as_index=False).agg(lambda x: ";".join(pd.unique(x)))`. My "disappearing" column has a lot of `NA`, but otherwise strings. I find it strange that this behavior occurs here, but I've also been unlucky in getting a decent working example – Sos Jul 26 '19 at 06:45
Also, I want the `NA` to be dropped, just not the rows that also have values – Sos Jul 26 '19 at 06:50
1

That would've been easier indeed. I ended up doing `df.mycols.fillna('')`, grouping, and then `df.problem_col.replace('^;|;$','')` – Sos Jul 26 '19 at 08:41
Aagh. I had a character deep in a CSV file so Pandas "naturally" decided to silently fail to compute .sum() of that column and instead just deleted that column. I'm astonished that this is the default behaviour. – Tunneller May 08 '22 at 23:44
Pandas will not perform numerical groupby operations on `object` columns. You can check the column databype with [`DataFrame.dtypes`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html). When pulling data from a SQL database, sometimes the null values come across as `None`, so you will need to replace the `None` values with `np.nan` using `df.fillna(np.nan)` or equivalent. – Foggy Jul 28 '22 at 16:50
For anyone running into this: after reading floats from a csv, pandas interpreted one column as object rather than float. I checked with df.dtypes and converted this one column to numeric `df.somecol = pd.to_numeric(df.somecol)` after that everything worked as expected. No other solution here worked for me. – squarespiral Nov 26 '22 at 12:45

score 12 · Answer 2 · answered Jun 15 '20 at 20:01

12

Try df.groupby(['col_1', 'col_2'], as_index=False).mean(). Use as_index=False to retain column names. Default is True. Above comments have answered this question but posting it as an answer.

answered Jun 15 '20 at 20:01

Digvijay Sawant

1,049
3
16
32

score 5 · Answer 3 · answered Sep 01 '20 at 12:39

5

Make sure your column is in numeric/int format and not e.g. as 'O' as Object format. This was one reason it was disapearing for me.

You can check the format of the column by hte code below:

df.column.dtypes

answered Sep 01 '20 at 12:39

Sami Navesi

160
1
6

1

The code here is incorrect. The correct code is `df.dtypes`. – Foggy Jul 28 '22 at 16:36

pandas groupby dropping columns

3 Answers3

Linked